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ABSTRACT 


In  the  context  of  multiple  linear  regression,  when  a 
subset  of  k-out-of-p  predictor  variables  is  to  be  selected 
for  the  purpose  of  predicting  the  response  at  some  known 
point  in  the  predictor  variables*  space,  the  width  of  the 
resulting  prediction  interval  gives  an  indication  of  the 
precision  with  which  the  response  is  predicted  and,  thus,  it 
may  provide  a  suitable  selection  criterion. 

A  review  of  commonly  used  selection  criteria  is  given, 
with  special  emphasis  on  those  which  deal  with  the  problem 
of  prediction.  The  Mahalanobis  distance  is  one  of  the  quan¬ 
tities  affecting  the  width  of  the  prediction  interval,  and 
it  is  studied  in  some  detail.  The  effects  of  adding  a  new 
variable  to  a  model  are  investigated  and  a  monotonicity 
theorem  is  derived. 

The  influence  of  an  observation  on  the  width  of  the 
prediction  interval,  as  measured  by  the  effected  change  when 
that  observation  is  set  aside,  is  also  investigated  and  an 
equivalence  between  observation  deletions  and  variable  aug¬ 
mentation  is  shown.  --  - 

The  relationships  found  in  these  investigations  indi¬ 
cate  the  applicability  of  certain  computing  techniques. 
Computing  algorithms  are  presented. 


iii 


A  management  science  application  of  the  statistical 
procedures  developed  in  this  study  is  explored  in  the  area 
of  parametric  cost  estimation. 
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CHAPTER  I 


INTRODUCTION 

Multiple  regression  analysis  is,  probably,  the  most 
widely  used  and  abused  or  all  statistical  tools  (5) .  Many 
authors  attest  to  the  importance  and  wide  applicability  of 
this  technique  (5),  (9),  (28),  etc.  The  advent  of  high¬ 
speed  digital  computers  and  the  development  of  efficient 
software  packages  have  made  it  accessible  to  users  from  all 
fields  of  research.  Associated  with  the  enhanced  availabil¬ 
ity  and  ease  of  use  provided  by  these  technological  develop¬ 
ments  is  a  tendency  to  apply  regression  techniques  routinely 
•  • 

and  mechanically,  without  due  consideration  to  the  underly¬ 
ing  theory  or  the  empirical  "rules  of  thumb"  consistent  with 
that  theory  and  with  common  sense. 

The  main  step  in  any  regression  analysis  is  the  devel¬ 
opment  of  an  equation  relating  one  variable,  commonly  refer¬ 
red  to  as  the  response  variable,  to  another  set  of  varia¬ 
bles,  called  explanatory  or  predictor  variables.  For  some 
highly  structured  applications  in  the  physical  sciences,  the 
exact  form  of  the  appropriate  regression  function  may  be 
known  to  the  experimenter.  In  other  cases,  theory  may  spec¬ 
ify  a  functional  form  to  be  tested.  These  cases,  however, 
are  the  exception  rather  than  the  rule.  More  often,  the 
analyst  is  uncertain  about  which  variables  are  important 


carriers  of  information,  as  well  as  about  the  form  of  the 
relation.  In  those  cases,  the  analyst  must  let  the  data 
speak  for  itself  in  suggesting  candidate  model  specifica¬ 
tions.  This  process  is  referred  to  as  "data  mining"  by 
Learner  (22)  ,  and  is  more  formally  known  as  empirical  model 
building.  Usually,  at  an  early  stage,  a  large  number  of 
potential  predictor  variables  must  be  considered,  some  of 
which  may  be  transformations  of  other  variables.  The  task 
of  the  analyst  is  to  bring  to  the  surface  the  "nuggets  of 
truth"  which  are  hidden  in  a  set  of  observations  on  the  var 
iables  by  means  of  a  thorough  and  appropriate  investiga¬ 
tion.  There  are  many  reasons  why  one  must  be  parsimonious 
in  his  use  of  variables.  Some  of  them  may  be  totally  irrel 
evantto  the  problem,  while  others  may  be  "conditionally 
irrelevant"  in  the  sense  that,  in  the  presence  of  other  var 
iables  they  possess  little  or  no  explanatory  value.  It  is 
tempting  to  use  "all  the  information"  of  the  "full"  model 
but  this  often  causes  problems  associated  with  what  is 
referred  to  as  "overfitting".  Models  with  many,  variables 
result  in  large  prediction  variances  (35)  ar,  well  as  statis 
tical  and  computational  instability  in  the  presence  of  mul- 
ticoll inear ity  among  the  retained  variables  (3).  Also 
important  is  the  fact  that  a  model  with  many  variables  may 
be  difficult  to  interpret  and/or  maintain.  Thus,  the  need 
arises  for  techniques  which  will  screen  the  variables  and 
select  a  subset  of  them  deemed  appropriate  for  the  intended 
use  of  the  model. 
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Various  techniques,  commonly  referred  to  as  "variable 
selection  criteria",  have  been  suggested  for  this  purpose, 
such  as  minimizing  the  mean  square  error  (MSE)  or,  equiva¬ 
lently,  maximizing  the  adjusted  coefficient  of  determina- 
2 

tion,  Ra,  maximizing  P,  minimizing  Mallow's  statistic 
etc.  In  Chapter  II,  the  need  for  variable  selection  and 
some  of  the  criteria  in  use  are  discussed.  All  of  these 
commonly  used  techniques  are  based  on  the  data  only  through 
the  sum  of  square  errors  (SSE) .  As  a  result,  for  any  given 
number  of  variables,  they  all  select  the  same  subset, 
namely,  the  one  which  minimizes  SSE.  This,  in  itself,  is  a 
rather  desirable  property,  especially  when  the  object  of  the 
analysis  is  the  explanation  of  relations  among  the  histori¬ 
cal  data.  However,  as  Lindley  (23)  emphasizes,  the  techni¬ 
que  used  to  develop  a  regression  equation  ought  to  be  rela¬ 
ted  to  the  intended  use.  When  the  object  of  the  analysis  is 
the  development  of  an  equation  which  will  be  used  in  order 
to  predict  the  response  at  a  known  point  in  the  space  of  the 
predictor  variables,  ignoring  this  additional  information  is 
contrary  to  Lindley 's  recommendation  and  to  common  sense. 

The  issue  of  how  to  use  such  information  needs,  therefore, 
to  be  investigated.  The  Mean  Square  Error  of  Prediction 
(MSEP)  criterion,  which  is  discussed  in  the  next  chapter, 
represents  an  attempt  towards  utilizing  the  information  car¬ 
ried  by  the  point  under  prediction.  Its  approach,  however, 
does  not  seem  to  be  fully  satisfactory  for  several  reasons 
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which  will  be  discussed  in  subsequent  chapters.  Therefore, 
there  remains  a  void  in  the  literature  in  this  respect, 
which  this  dissertation  attempts  to  fill. 

More  specifically,  the  problem  can  be  described  as  fol¬ 
lows:  A  future  observation  on  the  response  variable,  Y, 
must  be  predicted,  using  the  relational  information  provided 
by  a  set  of  n  historical  observations  on  Y  and  a  set  of  p 
predictor  variables  X^,X2,...,Xp  potentially  related  to  it, 
as  well  as  the  values  x  of  the  predictor  variables  associ¬ 
ated  with  that  future  observation.  The  relative  location  of 
x  with  respect  to  the  historical  data  yields  additional 
information  which,  if  ignored,  may  lead  to  models  not  well 
suited  to  predict  at  that  location. 

The  width  of  the  resulting  prediction  interval  at  x  is 
a  numeraire  which  seems  like  a  reasonable  basis  for  choosing 
among  alternative  models.  The  Mahalanobis  distance,  intro¬ 
duced  by  P.  C.  Mahalanobis  (24)  as  a  measure  of  divergence 
between  groups  of  multivariate  data,  affects  the  width  of 
the  prediction  interval  and  may  provide  a  measure  of  analog 
between  x  and  the  historical  data.  In  Chapter  III,  the  the¬ 
oretical  aspects  of  the  problem  are  investigated.  An  inter¬ 
esting  result  which  leads  both  to  an  easy  geometric  inter¬ 
pretation  and  to  existing  computing  techniques  is  derived 
in  the  form  of  a  monotonicity  theorem.  This  theorem  is  also 
used  in  order  to  explain  certain  observations  made  during 
the  simulations  which  were  conducted  and  the  analyses  which 
were  performed  on  real  data  sets. 
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The  computational  aspect  of  the  problem  as  it  relates 
to  the  proposed  selection  technique  is  investigated  in 
Chapter  IV.  This  is  an  important  consideration  because  the 
need  for  variable  selection  becomes  more  pronounced  as  the 
number  of  potential  predictor  variables  increases.  An 
existing,  efficient  algorithm  is  modified  for  the  purposes 
of  this  criterion,  by  utilizing  the  results  of  the  theorem 
in  Chapter  III.  Using  the  same  theorem,  stepwise  FORWARD 
selection  and  BACKWARD  elimination  algorithms  are 
developed. 

The  leverage  of  individual  observations  on  the  various 
quantities  of  interest  should  be  an  integral  part  of  any 
analysis  and  has  recently  received  deserved  attention  in  the 
literature  (7) ,  (8) ,  (13) ,  (18) ,  (36) .  Observations  which 
seem  discrepant  or  damaging  in  some  sense  appropriate  to  the 
analysis  are  allotted  special  attention  and  are  investigated 
further.  In  the  context  of  this  investigation,  an  obser¬ 
vation  calls  for  such  attention  if  its  deletion  from  the 
least  squares  calculations  results  in  a  significant  change 
in  the  width  of  the  prediction  interval.  Chapter  V  deals 
with  this  issue.  Some  results  are  derived,  some  observations 
are  made  and  computational  formulae  are  given. 

In  Chapter  VI,  an  application  from  the  field  of  manage¬ 
ment  science,  referred  to  as  parametric  cost  estimation,  is 
investigated.  A  real  data  set  is  analyzed  and  the  perfor¬ 
mance  of  this  criterion  is  compared  to  that  of  others.  The 
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results  of  a  limited  simulation  study  are  also  briefly  dis¬ 
cussed. 

Finally,  in  Chapter  VII,  some  concluding  remarks  are 
made,  suggestions  for  the  use  of  the  new  criterion  are  given 
and  questions  relevant  to  the  problem  at  hand  which  were  not 
investigated  in  this  study  are  raised. 

A  word  of  warning  is  appropriate  here,  which  applies  to 
every  statistical  analysis  of  data  and,  in  particular,  to 
every  variable  selection  technique.  As  was  mentioned  above, 
regression  is  one  of  the  most  widely  used  statistical  tech¬ 
niques  because  of  its  wide  applicability,  ease  of  use  and 
elegance.  It  is  also  one  of  the  most  widely  abused  techni¬ 
ques.  Two  of  the  reasons  for  such  abuse  are: 

(a)  The  proliferation  of  efficient  statistical  packages 
with  a  variety  of  regression  options. 

(b)  The  lack  of  awareness  on  the  part  of  the  practitioner 
about  the  dangers  of  such  misuse. 

It  may  be  that  the  practitioner  has  not  been  warned  by  the 
statistician  emphatically  or  frequently  enough.  However,  it 
remains  of  paramount  importance  that  the  practitioner  be 
aware  of  the  following: 

There  is  no  substitute  for  a  well  thought  out,  well  executed 
and  complete  analysis.  There  are  many  sides  to  an  analysis 
and  data  sets  behave  in  their  own  peculiar  ways.  Standard, 
mechanical  approaches  often  fail  to  reveal  these  peculiari¬ 
ties  and,  even  if  they  do,  appropriate  remedial  action 
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requires  more  than  superficial  familiarity  with  the  model 
and  its  relation  to  the  real  world  process.  For  example, 
there  is  no  variable  selection  technique  which  is  automati¬ 
cally  applicable  to  all  situations.  Even  for  a  given  data 
set,  there  is  rarely  a  "best"  criterion  or  a  "best”  model 
that  is  known  to  the  analyst.  For  a  good  analysis,  poten¬ 
tial  variables  and  candidate  model  specifications  should  be 
decided  after  "eyeballing"  the  data,  and  with  input  from  the 
experts  in  the  field  of  application.  Part  of  the  data 
should  be  set  aside  for  validation  purposes,  whenever  such  a 
luxury  can  be  afforded.  Models  should  be  kept  for  further 

scrutiny  that  have  good  "automatic"  properties  such  as  large 
2 

R  ,  small  C^,  small  prediction  intervals  etc.  If  probabil¬ 
istic  statements  are  to  be  made,  which  is  almost  always  the 
case,  the  residuals  should  be  analyzed  for  indications  of 
model  inadequacy  and  of  violations  of  the  assumptions  on  the 
errors.  Finally,  the  model  (or  models)  passing  all  tests 
should  be  subjected  to  the  criticisms  of  the  experts  in  the 
field.  In  the  process  described  above,  only  the  computa¬ 
tions  may  be  done  in  an  automatic  way.  The  analyst's  judge¬ 
ment  and  knowledge  put  to  good  use  is  what  constitutes  the 
difference  between  data  analysis  and  the  simple  processing 


of  data. 


CHAPTER  II 


ON  THE  SELECTION  OF  VARIABLES 

The  Need  for  Variable  Selection 
In  most  practical  situations,  finding  an  equation  which 
will  describe  a  set  of  data  collected  in  a  manner  referred 
to  by  Box  (5)  as  an  "unplanned  experiment"  is  a  difficult 
task.  The  problem  which  is  investigated  in  this  disserta¬ 
tion  can  be  described  as  follows: 

There  are  available  n  observations  (fundamental  meas¬ 
urements)  on  one  response  variable,  V,  denoted  by  Vi# 
i«l,2,...,n  and  n  associated  observations  on  m  basic,  or 

fundamental,  variables  Z.,Z0,...,Z  ,  denoted  by 

1  i  m 

Zil' Zi2' *  * ' ' Zim'  i*l»2,...,n.  There  is  one  more  measure¬ 
ment  z1,z2,...,zm  on  the  basic  variables.  An  equation  of 
the  form 


Y. 

i 


+  8ixii  + 


+  ViP  +  ei 


(2.1) 


is  assumed  relating  Y^  »  f(V^)  and  X^j  *  ^Zil' * • * 'Zim^  ' 

j*l,2,...,p,  i«l,2,...,n.  Henceforth,  the  variables  X^, 

j«l,2,...,p  will  be  referred  to  as  explanatory,  or  predictor 

variables.  This  equation  is  assumed  to  be  linear  in  the 

parameters  Sq,S^,...,S^  and  it  need  not  be  linear  in  the 

original  variables  zi'Z2'*”'Zm  as  9j'  j“l»2,...,p  may  be 

2 

any  functions  of  those  variables.  For  example,  X-^Z^, 
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X2mZ^Z^,  X2*logZ4  etc.,  will  produce  an  equation  which  is 
not  linear  in  the  basic  variables  Z. , Z- , . . . , Z_.  In  matrix 
terms,  the  model  can  be  described  by  Y  *  X8  +  t,  where  Y  is 
the  n*l  vector  of  responses,  g  is  the  n*(p+l)  matrix  of  the 
values  of  the  explanatory  variables  whose  first  column  con¬ 
sists  of  l's  and  which  is  assumed  to  be  of  full  column  rank, 
and  3.  is  the  p+1  dimensional  vector  of  the  unknown  parame¬ 
ters.  The  object  of  most  statistical  analyses  is  to  esti¬ 
mate  the  parameters  £  *  (Sg,  Sj., . . . ,  8  1  ’  by  means  of 
b  *  [bQ,b., . . . ,b  ] ' .  The  estimate  b  is  usually  obtained  by 
the  method  of  least  squares,  i.e.  bg,b^,...,b  are  such 
that 


Yi  *  bo  '*  bixu  +"'  +  Vip  +  et 


(2.2) 


with 


^1^*1  *  ^Yi”b0”kixii” '  ’  *  *  '”bpXip^  ^  * 


(2.3) 


In  this  investigation,  the  case  where  the  resulting  equation 
will  be  used  to  predict  the  response  y  associated  with  the 


point  x  »  [x. 


,xpl,  where  *j— gj (*L , 


V'  3-1'2- 


is  considered.  In  these  cases,  the  main  consideration  is 
the  accurate  prediction  of  y  rather  than  the  estimation  of 
j3.  For  curve  fitting  purposes,  the  n-dimensional  vectors 
X^Xj,....^  are  assumed  independent  in  the  algebraic 
sense.  In  order  to  make  statistical  inferences  about  the 


standard  errors  of  the  estimates,  the  precision  of  the  pre- 


dieted  values  etc.,  the  errors  are,  in  addition,  assumed 

2 

to  be  jointly  distributed  as  N(0,a  I),  i.e.  normal,  with 
mean  vector  zero  and  covariance  matrix  o2l.  This  covariance 
structure  implies  that  the  errors  have  equal  variances 
(homoscedastic)  and  are  uncorrelated. 

The  initial  choice  of  what  aspects  of  the  sample  units 
ought  to  be  measured  may  be  straightforward,  as  is  the  case 
in  well  understood  situations  where  physical  laws  and  theo¬ 
ries  apply  or  are  being  tested.  In  less  structured  situa¬ 
tions,  such  as  behavioral  research,  exploratory  studies 
often  start  by  measuring  most  everything  and  then  let  the 
data  "speak  for  itself"  in  identifying  the  important  varia¬ 
bles  and  forms  of  relations.  Whatever  process  is  used  in 
assembling  the  set  of  p  candidate  predictors,  X1#X2, . . . ,Xp, 
it  is  hoped  that  the  list  is  extensive  enough  to  include  all 
of  those  which  have  influence  on  the  response  variable  Y. 

To  be  so  inclusive,  the  list  often  contains  useless  varia¬ 
bles  and/or  variables  whose  informational  value  is  superflu¬ 
ous  in  the  presence  of  other  explanatory  variables.  As  part 
of  the  more  general  problem  of  analyzing  a  given  set  of 
data,  subsets  of  variables  must  be  selected,  which  seem  to 
explain  the  data  adequately.  Selecting  the  essential  varia¬ 
bles  is  a  source  of  trouble  with  unplanned  data.  One  major 
reason  for  this  is  that  the  problem  does  not  yield  to  a  uni¬ 
versal  definition.  What  is  precisely  meant  by  saying  that  a 
model  is  sought  which  "adequately  represents  the  data"? 
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Which  facet  of  the  data  should  the  analyst  ask  the  chosen 
model  to  represent  best?  The  answer  to  such  questions  must 
depend  on  the  intended  use  of  the  model  as  discussed  by 
Lind ley  (23) .  The  idea  of  a  model  which  is  "best"  for  pre¬ 
diction,  or  a  model  which  is  "best"  for  estimation,  for 
instance,  is  elusive.  Indeed,  several  answers  to  such  ques¬ 
tions  might  be  appropriate  as  the  problem  is  not  one  but 
several,  intricately  interwoven.  It  is  a  generally  accepted 
maxim  among  statisticians,  however,  that  parsimony  in  model 
building  is  desirable.  There  are  several  theoretical  and 
practical  reasons  for  this  view  as  follows: 

1.  Models  with  too  many  variables  usually  result  in 
large  prediction  variances  due  to  the  fact  that  many  para¬ 
meters  have  to  be  estimated.  Walls  and  Weeks  (35)  have 
shown  that  the  variance  of  prediction  increases  with  the 
number  of  variables  in  the  regression  equation.  For  this 
reason,  the  analyst  would  like  to  detect  and  exclude  those 
variables  which  are  either  irrelevant  to  the  problem  or  un¬ 
necessary  in  the  presence  of  others  which  are  to  be  retained 

2.  With  a  large  number  of  variables,  statistical  in¬ 
stability  of  the  resulting  equation  is  more  likely  to  occur. 
Statistical  instability  is  the  phenomenon  in  which  a  small 
perturbation  in  the  values  of  some  of  the  variables  results 
in  a  large  change  in  the  coefficients  of  the  fitted  equation 
This  is  one  of  the  visible  effects  of  multicollinearity, 
namely  the  phenomenon  of  strong  association  among  the 


retained  variables.  Mathematically ,  this  phenomenon  occurs 
when  the  matrix  X'g  is  nearly  degenerate.  The  phenomenon  of 
multicollinearity  can  appear  because  the  data  come  from  a 
subspace  of  the  true  sample  space,  one  that  can  almost  be 
described  in  fewer  dimensions.  This  may  happen  either  by 
chance,  or  by  necessity,  or  by  the  inclusion  of  extraneous 
variables  which  are  strongly  associated  with  the  relevant 
predictors  (21),  (27).  In  such  cases,  the  estimates  of  the 
regression  coefficients  have  large  variances  resulting  in 
instability  of  the  hyperplane  defined  by  the  regression 
equation.  This  is  easy  to  visualize  in  the  case  of  two 
highly  correlated  explanatory  variables.  If  the  data  are 
nearly  collinear  (one-dimensional  subspace) ,  the  regression 
plane  is  "resting  on  a  knife's  edge".  Any  perturbation  in 
the  data  can  make  it  tilt  heavily.  Again,  it  might  be 
desirable  to  use  a  subset  of  variables  so  as  to  alleviate 
the  problem,  especially  if  the  variables  which  are  causing 
the  multicollinearity  are  extraneous  anyway. 

3.  Another  undesirable  effect  of  multicollinearity  is 
computational  instability,  resulting  in  potentially  large 
roundoff  errors  (3) . 

4.  Finally,  from  the  purely  practical  point  of  view,  a 
model  with  many  variables  may  be  difficult  to  interpret, 
difficult  or  costly  to  maintain,  or  both.  Interpretation  of 
relations  between  individual  predictor  variables  or  groups 
of  them  and  the  response  variable  is  often  desirable,  and 
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collection  of  data  on  certain  variables  is  often  difficult, 
unreliable  or  costly. 

The  reasons  mentioned  above  should  suffice  in  explain¬ 
ing  why  the  problem  of  variable  selection  is  real  and, 
often,  of  great  practical  importance.  In  the  next  section, 
some  commonly  applied  selection  techniques  are  discussed. 

Review  of  Selection  Criteria 
For  this  section  and  the  ones  that  follow,  some  new 
notation  will  be  needed.  The  i-th  response  is  estimated  by 

}i  -  bo  ♦  bi*ll  ♦  •••  ♦  bpxip  i-l.i . (2.4) 

The  i-th  residual  is  defined  by  e^^  *  Y^-Y^  and  the  8uin 
squared  errors  (residual  sum  of  squares)  is  defined  by 
n  - 

SSE  »  l  ef  .  (2.5) 

i-1  * 

In  variable  selection  the  possibility  of  setting  some 
of  the  p  coefficients  equal  to  zero  is  considered.  This 
amounts  to  selecting  a  subset  of  k,  say,  out  of  the  p  varia¬ 
bles.  The  mean  of  the  n  observations  on  the  response  varia¬ 
ble  is  denoted  by  Y,  the  total  sum  of  squares  of  deviations 
from  that  mean  is  defined  by 
n  - 

SSTO  -  I  (Y.-Y)**  (2.6) 

i-1  1 

and  the  regression  sum  of  squares  by 


SSR  -  SSTO  -  SSE 


(2.7) 
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A  selection  criterion  is  a  rule  according  to  which  a 
certain  model  out  of  the  2^  possible  models  is  labeled 
"best".  It  should  be  noted  that  "best"  is  defined  only  in 
the  sense  of  the  particular  criterion  employed,  and  it  does 
not  necessarily  imply  that  that  model  is  best  in  terms  of 
its  intended  use  or  in  terms  of  how  well  the  relation 
defined  by  it  carries  over  to  the  population.  The  position 
taken  in  this  dissertation  concerning  variable  selection  and 
model  building  is  more  general,  namely,  that  "selection" 
rules  ought  to  be  used  in  order  to  screen  the  2^  models  down 
to  a  more  manageable  number,  say  half  a  dozen  or  so,  which, 
subsequently,  would  be  carefully  scrutinized  for  adequacy 
and  reasonableness.  There  are  several  criteria  currently 
used  for  this  purpose.  The  most  common  ones,  as  well  as 
some  which  are  related  to  the  problem  of  prediction  are  dis¬ 
cussed  next. 

2 

1.  The  R  Criterion 

The  coefficient  of  determination  is  defined  by 

R2  -  1  -  SSE/SSTO.  (2.8) 

2  ... 

It  is  clear  that  R  is  the  proportion  of  variability  in  Y 

which  is  explained  by  the  variables  in  the  model  under  con¬ 
sideration.  It  seems  desirable  that,  other  things  being 
2 

equal,  R  should  be  as  large  as  possible.  However,  since 

SSE  can  not  increase  as  variables  are  introduced  into  the 
2 

model,  R  will  always  achieve  its  maximum  when  all  p 


15 


2 

variables  are  used.  If  R  is  to  be  used  as  a  selection 

criterion,  some  subjective  rule  must  be  employed  that  will 

determine  when  the  largest  increase  possible  in  R  attained 

by  the  introduction  of  a  new  variable  does  not  compensate 

for  the  loss  in  degrees  of  freedom  due  to  estimating  an 

2 

additional  parameter.  A  graph  of  R  versus  model  size  is 
usually  helpful  in  devising  what  is  called  an  "elbow  rule". 

2 

2.  The  Adjusted  R  Criterion 
(Mean  Square  Error) 

2 

To  overcome  the  subjectivity  involved  in  using  R  ,  an 

adjustment  for  degrees  of  freedom  can  be  made  by  defining 

2 

the  adjusted  coefficient  of  determination,  R  ,  by 

R2  *  1  -  [SSE/  (n-k-1)  }/[SSTO/(n-l)  ]  (2.9) 

where  k  is  the  number  of  predictor  variables  in  the  postu¬ 
lated  model.  This  statistic  usually  achieves  a  maximum  with 
a  model  containing  fewer  than  p  variables.  The  equation 

R2  -  R2  -  k(l-R2)/  (n-k-1)  (2.10) 

2  2 

shows  the  relationship  between  the  statistics  R  and  R~. 

a 

This  criterion  is  equivalent  to  selecting  the  model  with 

smallest  mean  square  error,  defined  by  MSE  *  SSE/ (n-k-1) , 

2 

since  the  denominator  in  R  does  not  change  with  the  varia- 

a 

bles  selected.  A  preference  for  choosing  models  with  large 
2 

Rft  is  based  on  the  fact  that  the  "true"  model  minimizes  the 
expected  MSE  (32) .  This  criterion  is  most  often  referred  to 
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as  the  "minimum  Mean  Square  Error"  criterion,  and  this  name 
will  be  used  in  what  follows. 

3.  The  Maximum  F  Criterion 

Sometimes  it  is  deemed  desirable  to  maximize  the  ratio 

F*  [ (SSTO-SSE)/k]/[SSE/(n-k-l) ] .  (2.11) 

The  numerator  in  the  expression  above  is  referred  to  as  the 
regression  mean  square.  This  criterion  is  used  less  fre¬ 
quently  than  the  others,  and  it  is  very  parsimonious.  That 
is  to  say,  it  tends  to  select  models  with  very  few  varia¬ 
bles  . 

4.  Mallow's  Criterion 
Mallows  (25),  introduced  the  statistic 

Ck  «  SSE/a2  +  (2k-n)  (2.12) 

^  2  2 

where  a  is  an  estimate  of  a  .  Ck  is  an  estimate  of  the 
standardized  total  squared  error  of  predicting  at  the  points 
in  the  data  base  (19).  A  model  with  small  bias  is  expected 
to  yield  a  C^  statistic  about  equal  to  the  number  of  varia¬ 
bles,  k,  associated  with  it  as  can  easily  be  shown.  In  this 
investigation,  the  model  with  smallest  Ck  will  be  referred 
to  as  the  "minimum  Ck"  model  and  will  be  used  for  comparison 

purposes.  Usually,  the  MSE  of  the  model  containing  all  var- 

A  2 

iables  under  consideration  is  used  for  a  ,  although  this 
forces  Cp  to  be  equal  to  p.  Easily  interpretable  plots  of 


versus  k  can  be  drawn  and  it  is  suggested  that  models 
with  small  be  considered.  The  statistic  and  its  prop¬ 

erties  have  been  discussed  by  Daniel  and  Wood  (9) ,  Gorman 
and  Toman  (15),  Hocking  (19),  Mallows  (25),  (26)  and  oth¬ 
ers. 

All  of  the  criteria  discussed  above  share  two  proper¬ 
ties. 

(a)  They  are  all  simple  functions  of  SSE  and,  thus,  for  any 
fixed  number  of  variables,  they  all  select  the  same  model, 
namely  the  one  which  minimizes  SSE. 

(b)  If  the  final  model  is  to  be  used  in  order  to  predict 
the  response  y  at  a  known  point  x  in  the  space  of  the  pre¬ 
dictor  variables,  they  all  ignore  its  location  and  its  char¬ 
acteristics  with  respect  to  the  historical  data.  As 
Wallenius  (34)  pointed  out,  "...the  first  of  these  proper¬ 
ties  is  reasonable  but  myopic  when  the  object  is  predic¬ 
tion.  The  second  one  seems  contrary  to  all  reason." 

Of  the  four  criteria,  Mallow's  technique  is  more 
directly  related  to  the  problem  of  prediction  in  view  of  the 
fact  that  it  utilizes  the  total  square  error  of  prediction. 

5.  The  Prediction  Sum  of  Squares 
Criterion 

David  M.  Allen  (2)  suggested  the  following  selection 
procedure : 

*  M  ) 

Let  Y)  ,  i«l,2,...,n  denote  the  i-th  predicted  response, 

when  a  given  model  is  used,  and  with  the  i-th  observation 
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removed  from  the  data  base,  so  that  the  coefficients  are 
derived  from  the  least  squares  calculations  based  only  on 
the  remaining  n-1  observations.  For  each  model,  compute  the 


prediction  sum  of  squares  (PRESS)  statistic,  given  by 

■  f  iy,-?'1',2. 

i=l  1  1 


PRESS 


(2.13) 


Consider  models  with  small  PRESS.  Notice  that  PRESS  is  an 
indication  of  the  predictive  performance  of  a  model  over  the 
points  in  the  data  base.  Intuitively,  a  model  with  small 
PRESS  should  be  expected  to  do  better  in  predicting  future 
observations  than  a  model  with  large  PRESS.  However,  since 
this  technique  fails  to  take  into  account  the  values  of  the 
variables  at  the  point  under  prediction,  it  is  conceivable 
that  there  can  be  points,  both  in  the  region  of  the  histori¬ 
cal  data  and  outside,  where  the  selected  model  may  be  inap¬ 
propriate. 

In  terms  of  the  computational  aspect  of  the  problem, 
this  method  is  much  more  demanding  than  the  four  previously 
mentioned. 


6.  Mean  Square  Error  of  Prediction 
The  mean  square  error  of  prediction  (MSEP)  of  the 
response  y  at  a  given  point  x  can  be  expressed  as 


E(y-y)^  ■  a2  +  Var(y)  +  (bias)2,  i.e.  (2.14) 

E(y-y)2  -  o2  +  +  [E  (y)  -E  (y)  ]  2 . 


(2.15) 
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Allen  (1)  proposed  choosing  subsets  of  variables  which 
minimize  an  estimate  of  the  mean  square  error  of  predic¬ 
tion.  This  is  a  difficult  task  to  accomplish  successfully, 
mainly  because  of  the  fact  that  a  good  estimate  of  the  bias 
term  assumes  knowledge  of  E(v)  or,  at  least,  a  very  good 
estimate  of  it.  However,  this  is  at  the  very  heart  of  the 
problem  of  prediction  which  the  analysis  attempts  to  solve. 
An  assumption  about  good  knowledge  of  E(y)  seems  to  create  a 
logical  vicious  circle  in  that  the  unknown  answer  to  the 
problem  is  somehow  used  in  order  to  get  to  it.  Allen's 
approach  to  this  is  to  assume  that  the  full  model  contains 
variables  which  were  chosen  carefully,  so  as  to  include  all 
relevant  ones  and  exclude  all  unnecessary  ones.  As  a  result 
cf  this,  the  full  model  will  be  unbiased,  while  any  submodel 
will  be  biased  to  a  measurable  degree.  The  bias  associated 
with  a  given  submodel  is  then  estimated  by  the  difference  in 

point  predictions  between  the  full  model  and  the  submodel, 

2 

an  estimate  of  a  based  on  the  sum  of  squared  errors  of  the 

A  2 

full  model  is  used  in  the  expression  for  E(y-y)  and  the 
submodel  is  found  preferable  to  the  full  model  if  and  only 
if  the  reduction  in  prediction  variance  is  greater  than  the 
square  of  the  bias. 

Even  with  the  assumptions  mentioned  above,  the  method 
of  estimating  the  bias  (a  difference  in  expectations)  by 
means  of  a  difference  in  two  point  predictions  may  result  in 
treating  different  submodels  unfairly.  The  degree  of  this 
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unfairness  will  depend  on  the  difference  between  E(y)  and 
the  point  prediction  obtained  from  the  full  model.  It 
should  also  be  observed  that  the  MSE's  of  the  postulated 
submodels  are  not  taken  into  consideration.  As  a  result, 
the  selected  model  may  provide  a  very  poor  fit  to  the  data. 
As  will  be  discussed  in  Chapters  III  and  VI,  this  seems  to 
be  frequently  the  case  in  practice.  With  regard  to  the  com¬ 
putational  aspect,  Allen  proposed  a  sequential  procedure 
which  provides  no  guarantee  that  an  absolute  minimum  will  be 
obtained,  either  overall  or  for  a  fixed  subset  size  k.  It 
seems  that,  for  such  a  guarantee,  a  complete  search  of  all 
2P-1  regressions  might  be  necessary.  However,  the  algorithm 
which  will  be  developed  in  Chapter  IV  may  be  modified  so  as 
to  apply  to  the  MSEP  criterion  . 

In  the  next  chapter,  a  somewhat  different  approach  is 
taken  to  the  problem  described  above.  The  position  taken  in 
this  dissertation  with  respect  to  bias  is  the  following: 

When  the  true  population  model  is  known,  the  bias  at  x  asso¬ 
ciated  with  other  models  can  be  obtained.  In  empirical 
work,  however,  where  the  true  model  is  not  only  unknown  but 
its  notion  is  not  even  easily  or  well  defined,  that  bias  can 
not  be  objectively  measured.  Indirect  methods  may  be 
employed  that  can,  hopefully,  give  indications  of  its  magni¬ 
tude.  It  is  believed,  however,  that  using  such  estimates 
directly  in  the  screening  of  variables  is  risky  at  best  and 
rather  inapropriate .  Thus,  no  explicit  attempt  is  made  to 
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estimate  bias  during  the  selection.  Implicitly,  bias  is 

hoped  to  be  reflected  in  the  size  of  MSE  which  will  be  used 

2 

as  an  estimate  of  a  . 


CHAPTER  III 


ON  THE  WIDTH  OF  THE  PREDICTION  INTERVAL 

Mahalanobis  Distance  as  a 
Measure  of  Analog 

The  Mahalanobis  distance,  introduced  by  P.  C. 
Mahalanobis  (24)  as  a  measure  of  the  distance  between  two 
multivariate  populations ,  is  a  fundamental  notion  in  mul¬ 
tivariate  statistics.  In  this  section,  the  Mahalanobis 
distance  is  discussed  in  relation  to  the  problems  of 
prediction  and  variable  selection.  The  insight  gained 
through  this  investigation  will  be  used  to  explain  certain 
observations  which  were  made  during  the  course  of  this 
study  and  to  derive  computational  algorithms  for  implemen¬ 
tation  of  the  methodology  developed. 

Suppose  there  are  n  historical  observations  on  p 
potential  predictor  variables,  X^ ,X^2  >  •  •  •  rXip, 
i*l,2,...,n.  Let  Y  denote  the  n*l  vector  consisting  of 
the  observed  values  of  the  response  variable,  associated 
with  these  n  observations ,  and  let  X j  denote  the  n* 1 
vector  of  observed  values  on  variable  X^,  i.e., 

Y»  [Y1,Y2,... ,YnJ '  and  Xj  -  [xij'x2j'***'xnj]' *  Let  2 
denote  the  n*p  matrix  whose  columns  are  X^ ,  j  *  l,...,p, 

and  let  X^  *  —  £  X^^  be  the  sample  mean  of  variable  X^. 

The  point  X  *  [X^,X2 , . . . ,X^]  is  defined  as  the  centroid 
of  the  data.  Finally,  let 


2 


denote  the  sample  covariance  matrix  of  variables 


(i . e.  ,  s 


ij  *  A  jx  ^ki-V'VfV’- 


Denoting  the  values  of  the  point  under  prediction  by  lower 
case  letters,  the  response  y  at  the  point  x  must  be  pre¬ 
dicted  by  exploiting  the  predictive  relationship  between  Y 
and  the  characteristics  X^,...,X  ,  and  the  degree  of 
analogy  between  x  and  X.  In  general  geometric  terms, 
"degree  of  analogy"  refers  to  the  position  of  x  relative 
to  X  in  p-dimensional  space.  If  x  is  far  removed  from  the 
historical  data,  extrapolation  is  necessary  with  all  the 
attendant  risks.  This  point  will  be  discussed  in  more 
detail  in  the  next  section.  The  issue  of  how  to  detect 


such  extrapolation  is  considered  first. 

The  standard  Euclidean  distance  may  fail  to  reveal 
the  degree  of  extrapolation,  due  to  the  intercorrelations 
among  the  variables.  Points  at  a  small  Euclidean  distance 
from  the  centroid  X  of  the  data  may  be  very  non-analogous 
in  that  their  coordinates  do  not  conform  with  the  cor¬ 


relation  structure  observed  in  the  historical  data.  This 


point  can  be  illustrated  in  two  dimensions.  In  Figure  1 
below,  the  point  x  is  at  a  rather  small  Euclidean  distance 
from  the  centroid  X.  Nevertheless,  it  is  well  outside  the 
bulk  of  the  data,  because  its  coordinates  do  not  conform 
with  the  negative  correlation  observed. 


Extrapolation  in  Two  Dimensions. 


Observe  also  that  the  extrapolation  in  the  two-dimen¬ 
sional  scatter  would  not  have  been  revealed  by  simple  mar¬ 
ginal  comparisons.  Each  coordinate  of  x  is  well  inside  the 
range  of  the  data  along  the  corresponding  dimension.  Of 
course,  in  two  or  three  dimensions,  scatter  diagrams  can  be 
plotted,  which  will  reveal  this  phenomenon.  In  higher 
dimensions,  a  different  method  becomes  necessary. 

The  Mahalanobis  distance  defined  by 

D  till  'B-2  *  - )  *  '  (3.1) 

is  a  measure  of  the  distance  between  two  multivariate  popu¬ 
lations  with  row  vector  means  and  and  common  positive 
definite  covariance  matrix  E.  The  degree  of  analogy  (dis¬ 
tance)  between  x  and  the  data  X  can  be  described  by  means  of 
the  sample  counterpart  of  the  above  measure,  namely 

M-  (x-X) S-1  (x-X)  •  .  (3.2) 

The  measure  M  will  be  referred  to  as  the  Mahalanobis  dis¬ 
tance  between  x  and  X.  Observe  that,  except  for  a  multipli- 

2 

cative  constant,  this  is  Hotelling's  T  statistic  used  to 
test  the  hypothesis  that  x  and  the  historical  data  come  from 
the  same  multinormal  population,  assuming  equal  covariances. 
In  the  univariate  case,  M  is  (a  multiple  of)  a  squared 
t- ratio.  In  the  multivariate  case  as  well,  M  can  be  viewed 
as  the  square  of  a  t-ratio.  It  is  the  squared  t-ratio  of 
that  linear  combination  of  the  variables  which  produces 
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the  largest  t- ratio.  Each  univariate  t-ratio  corresponds 
to  one  such  linear  combination.  However,  as  mentioned 
earlier,  marginal,  univariate  comparisons  may  fail  to  re¬ 
veal  the  degree  of  extrapolation.  All  univariate  t-ratios 
may  be  small,  although  the  multivariate  Mahalanobis  dis¬ 
tance  may  be  arbitrarily  large.  For  instance,  with  only 
two  variables,  M  cam  be  expressed  as 

M.-2r(M.M,) 1/2  +  M- 

M  «  -i - - -  (3.3) 

1-r 

where  and  are  the  corresponding  univariate  measures, 
and  r  is  the  sample  correlation  coefficient  between  the  two 
variables.  It  is  clear  that,  even  if  both  M^  and  M2  are 
small,  M  can  still  be  large.  For  example,  if  »  M2  *  e, 
then  m  »  2e/(l+r) ,  which  can  become  arbitrarily  large  with 
r  -*■  -1. 

A  few  things  might  be  of  interest  to  note  about  the 
Mahalanobis  distance.  Points  equidistant  from  the  centroid 
X  form  ellipsoids  with  center  at  X  whose  axes  coincide  with 
the  principal  components  axes  of  the  data.  It  is  a  distance 
measure,  that  is,  it  is  non-negative,  symmetric,  and  satis¬ 
fies  the  triangular  inequality.  If  S  =  I,  the  Mahalanobis 
distance  becomes  the  natural  Euclidean  distance  in  p-space. 
For  an  arbitrary  positive  definite  S,  the  Mahalanobis  dis¬ 
tance  is  equivalent  to  the  Euclidean  distance  in  the 

1/2  1/2 

"Mahalanobis  space"  S  x’ ,  where  S  is  the  symmetric 


square  root  of  S. 
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With  respect  to  this  investigation,  the  behavior  of 
the  Mahalanobis  distance  as  variables  enter  or  leave  the 
regression  equation  is  of  interest.  The  monotonicity  theo¬ 
rem  which  follows  allows  the  expression  of  the  change  in 
the  Mahalanobis  distance  as  a  new  variable  is  introduced, 
in  terms  of  easily  recognizable  regression  statistics  and 
provides  insight  which  will  aid  in  subsequent  analysis. 


Theorem  3.1:  Let  denote  the  Mahalanobis  distance 
between  x^  *  [xlf...,x^]  and  X^  *  [X^, . . . ,Xk] .  Let  S^1 
denote  the  covariance  matrix  of  variables  X1,...,XJc.  Let 
M^.+1  denote  the  Mahalanobis  distance  between 

Sfc.i  *  'W-i'  “d  4.1  ■  <4’5w  and  let  4M  -  •W'V 

Partition  the  covariance  matrix  of  X^, . . . ,Xk,Xk+1  as 


S21  S  2  2 } 
.  — i 


Then, 


AM 


(xk+l~xk+l) 

S22U-r2) 


(3.4) 


where 

^k+l  "  \+l+s2lsll(j£k“^k)  ’ 


and  r  is  the  multiple  correlation  coefficient  between 
variables  and  X^, . . . ,X^. 
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Proof:  Prom  a  well  known  matrix  identity  (16)  , 


.-1 


tSll"S12S22S211 


-SUS12CS22-S21S'llS12rl 


"S22S21 [Sll"S12S22S21] 


[S22"S21S11S121 


J 


and 


I”S22S21(Sll"S12S22S21)  ^ '  *  “S11S12 tS22_S21SllS12I  ^ 


Therefore, 


AM 


lSll~S12S22S21]  1(^k“-k)' 


(xk+r^k+l)S22S21[SH"S12S22S211  1(-k“^k)  ’  “ 
(xrtc)SllS12[S22“S21SllS123  1(xk+l“\+l)  + 
(xk+r\+l)  CS22’S21S11S12I  1(xk+l’*k+l)  " 

t-k"^3  tSll”S12S22S213  1(^k'^k)' 

2lxk+l"\+l)  tS22"S21SllS123  ls21Sll(-k"-k)  '  + 

^xk+l""^k+l3 2 [S22“S21S11S12]  1' (^k~— k3 S11 (£k“Hk> ' 

(3.5) 


Notice  that  the  correlation  coefficient,  r,  can  be  written 

as 


r  -  <S21s“Js12/S22)  1/2 ,  with  0  <  r2  <  1. 


Therefore,  relation  (3.5)  above  becomes 


AM  ■  (xk-£k)  CS11"S12S22S211  ^“-k*  ' 

!iVi-y  is22ll-l2,r‘s2isnl2A' '  * 
,]WW  2  (S22  <l-r2'  l'1-  (Xy-i*.'  S'n  (Sk-^I  ' 


Using  the  fact  (see  (30))  that 

f 


[A+UV* ] 


A 


l+v  a_lu 


with 


U 


*S12S22' 


s12  ^  A 


’ll' 


it  follows  that 


tSll”S12S22S21] 


- 1  S11S12S21S11 

11  c  -c  c“lg 

S22  S21&11512 


Thus,  relation  (3.6)  becomes 

AM  »  (Xk-Xk>  (Sj_^S^2S21>sj.i^S22  ^  *^k”— k* 

2(xk+r\+l)  [S22(1"r2)1  ls21Sll(5k“^k)  '  + 

(xk+l"*k+l)2[S22(1"r  )J 

^h)Sn(^hr  +  (2k-Hk,sn(^k)t 

{xk±l~  tWS21SU  (-k~-k} ' 1 


Observe,  first,  that  AM  >_  0.  Thus,  as  variables  are 
introduced  into  the  regression,  the  Mahalanobis  distance 
cannot  decrease.  The  resulting  increase  is  equal  to  the 


standardized  square  error  of  predicting  from  the 

regression  of  the  newly  introduced  variable  X^+^  on  the 
variables  X1»X2 , . . . ,X^  which  were  already  in  the  regression 
equation.  The  standardization  is  done  by  dividing  the 
resulting  squared  error  by  the  conditional  variance  of 
variable  X^+1  (conditioned  on  variables  X^ ,X2 , . . . ,X^) . 

This  standardization  implies  that  the  expected  change  in  M 
should  not  depend  on  the  strength  of  the  relationship 

between  Xk+1  and  X1#X2, _ ,X^.  Obviously,  AM  *  0  if  and 

only  if  *k+1  is  on  the  hyperplane  defined  by  the  regression 
mentioned  above.  The  Mahalanobis  distance  is  a  unitless 
quantity  whose  size  does  not  depend  on  the  units  of  the 
particular  problem.  For  a  given  problem,  x  and  X  are,  of 
course,  fixed.  Over  all  x  and  X,  however,  drawn  from  a 
k-variate  normal  population,  the  quantity: 


5TT  +  t5&<*k-2k,s‘l'Sk-2R> 


2 

is  distributed  like  Hotelling's  two-sample  T  ,  and,  so,  it 
has  the  distribution  of 


k(n-l)  - 
n-k  k,n-k' 


where  n_^  is  an  F  variate  with  k  and  n-k  degrees  of 
freedom.  This  distributional  property  of  M  implies  that 
its  expected  magnitude  depends  only  on  the  number  of 


variables,  k,  and  is  independent  of  the  specific  variables 
involved.  Thus,  the  relative  magnitude  of  the  realized  M 
for  a  particular  subset  of  variables  can  be  assessed. 

The  result  of  this  theorem  will  be  used  in  what  follows 
and,  in  particular,  in  Chapter  IV  where  the  computational 
aspect  of  the  problem  is  investigated. 

The  Prediction  Interval 

As  was  mentioned  earlier,  all  standard  variable  selec¬ 
tion  techniques  share  the  property  that,  for  any  given  num¬ 
ber  of  variables  in  the  regression  equation,  the  optimal 

set  is  the  one  which  minimizes  the  sum  of  squared  residuals 

2 

or,  equivalently,  maximizes  R  .  The  point  under  prediction 
may  be  rather  non- analogous  to  the  historical  data  (large  M) 
when  we  consider  the  set  of  variables  identified  as  "opti¬ 
mal"  by  the  criterion  used.  In  such  cases,  the  model  will 
be  required  to  extrapolate.  The  term  "extrapolation"  is 
used  here  in  the  sense  that  the  variance  of  prediction 

c^xfX’xrV  =  o2[±  +  (3.10) 

2 

is  large,  relative  to  the  inherent  error  variability  a  . 
Extrapolation  should  be  avoided  whenever  possible  for  two 
reasons : 

1.  The  hyperplane  defined  by  the  regression  equation 
may  fit  the  available  data  rather  well,  but  this  may  be 
true  only  in  the  region  of  the  X-space  in  which  data  are 
available.  The  true  model  for  the  full  X-space  may  well 
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be  quite  different  in  the  vicinity  of  x  thus  producing 
substantial  bias. 

2.  Even  if  the  variables  used  are  the  ones  generating 

the  response  values  Y,  the  variance  of  prediction  and,  as 

a  consequence,  the  errors  at  points  removed  from  the  bulk 

of  the  data  may  be  large  due  to  the  variances  associated 

with  the  estimates  b^ ,b^, . . . ,b^.  These  variances  may  be 

2 

large,  compared  with  a  ,  especially  in  the  presence  of  multi 

collinearity  among  the  retained  variables. 

Ideally,  variables  which  are  extraneous  to  the  problem 

as  well  as  variables  whose  presence  does  not  contribute 

significantly  to  the  explanation  of  the  variability  in  Y 

should  be  detected.  Dropping  such  variables  from  the 

regression  equation  has  the  effect  of  reducing,  the  variance 

of  prediction  at  x.  This,  of  course,  should  not  be  done  at 

the  expense  of  excluding  variables  whose  inclusion  would 

greatly  enhance  the  fit  of  the  data  as  measured  by  the  mean 

square  error.  Often,  there  are  several  models  which  come 

2 

close  to  the  "optimum"  in  terms  of  R  and  other  measures  of 

model  aptness  based  on  residual  analysis.  In  those  cases, 

by  using  a  slightly  sub-optimal  set  of  predictor  variables 

2 

(slight  decrease  in  R  ) ,  it  may  be  possible  to  substantially 
improve  the  degree  of  analogy  (decrease  M)  and  thus  reduce 
prediction  variance. 

To  illustrate  this  point,  suppose  that  two  single¬ 
variable  models  are  to  be  compared  in  terms  of  their 
expected  predictive  performance  at  x  *  [10,0].  Suppose, 
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moreover,  that  the  following  statistics  are  associated  with 
each  model  and  the  corresponding  variables: 


0.90,  XL  *  10, 


4, 


0.92,  X2 


10, 


4. 


Suppose  further  that  there  are  n  =  10  observations  on  X^, 

X2  and  Y,  and  that  *  4.  If  the  corresponding  mean  square 

2 

errors,  MSE^,  are  used  to  estimate  c  ,  the  prediction 
variances 


MSE±xi(X|Xi) 


i  *  1,2 


are  equal  to  0.180  for  the  first  model  and  1.144  for  the 
second  one.  Notice  that  the  corresponding  Mahalanobis  dis¬ 
tances  are  *  0  and  M2  *  6.25.  Even  though  the  second 
variable  results  in  a  slightly  better  fit  for  the  data,  the 
point  x  «  [10,0]  is  so  non-analogous  on  this  dimension 
that  it  might  be  preferable  to  use  variable  X^  for 
prediction. 

The  width  of  the  100(l-a)%  prediction  interval  at  x 
is  a  numeraire  which  reflects  the  situation  discussed  above 
and,  thus,  it  may  provide  a  reasonable  basis  for  choosing 
among  competing  models.  For  computational  simplicity,  a 
monotone  function  of  the  width  will  be  used,  namely  the 
square  of  the  half-width,  viz. 


W  »  F.  .  „  .  , MSE [ 

l-a;l,n-k-l 


n+1 


JLi 

n-1 


(3.11) 
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where  F,  „  ,  ,  ,  is  the  ci-th  fractile  of  an  F  distri- 

bution  with  1  and  n-k-1  degrees  of  freedom.  This  measure, 

W,  combines  fit  (MSE)  and  degree  of  analogy  (M) ,  with  a 
factor  F  which  penalizes  for  using  too  many  variables 
(increasing  k)  or  excluding  points  front  the  data  base 
(decreasing  n) .  In  this  form,  the  role  of  analogy  as 
measured  by  the  Mahalanobis  distance  becomes  evident. 

Failure  to  consider  this  factor  in  selecting  a  set  of  pre¬ 
dictor  variables  could  have  a  marked  effect  on  predictor 
precision  as  measured  by  the  width  of  the  prediction  inter¬ 
val,  and,  as  a  consequence,  on  the  accuracy  of  prediction 
as  measured  by  the  prediction  error. 

As  mentioned  in  the  previous  chapter,  the  position 
taken  in  this  dissertation  is  that  bias  is  not  am  issue  that 
can  be  dealt  with  directly  in  unstructured  situations  with 
which  empirical  model  building  is  concerned,  since  the  popu¬ 
lation  model  (true  model)  is  unknown.  Nevertheless,  it 
might  be  of  interest  to  note  that  the  mean  square  error  of 
prediction  of  the  elusive  population  model  is 

E(y-y)  2  -  +  0.12) 

Thus,  if  one  were  willing  to  assume  that  the  postu¬ 
lated  empirical  model  is  the  same  as  the  population  model, 
then  the  last  two  factors  in  W  would  be  am  estimate  of  the 
mean  square  error  of  prediction.  Notice  that  the  factor 

F,  ,  .  ,  is  only  a  penalizing  factor  for  lost  error 

1-a; 1; n-k-1  J 


degrees  of  freedom.  For  any  given  set  of  n  observations  and 

any  number  of  variables,  k,  the  set  of  predictor  variables 

2 

which  minimizes  W,  minimizes  also  this  estimate  of  E(y-y)  . 

General  Observations  About  W 
The  quantity  W  has  been  defined  as  "the  squared  half¬ 
width  of  a  100{l-ot)%  prediction  interval  at  x".  Its  statis¬ 
tical  validity  as  a  bona  fide  100(l-a)%  prediction  interval 

is  vitiated  if  the  model  is  selected  by  minimizing  W,  just 

2 

as  the  distributional  properties  of  R  are  no  longer  valid 

when  the  data  is  used  to  build  a  model  which  minimizes  MSE 

(11).  After  the  data  have  been  looked  into  and,  say,  the 

model  with  smallest  W  is  selected,  the  confidence  associated 

with  that  interval  will  be  less  than  100(l-a)%.  Therefore, 

1/2 

an  interval  centered  at  y  with  width  2W  should  not  be 
thought  of  as  a  100(l-a)%  prediction  interval  but  only  as 
a  relative  indication  of  the  predictive  performance  of  the 
various  models.  Although  no  concrete  statements  can  be  made, 
it  is  hoped  that  the  improvement  in  precision  will  be  accom¬ 
panied  by  am  improvement  in  prediction  accuracy.  For  this 
reason,  the  quantity  will  be  referred  to  as  "W"  instead  of 
"prediction  interval"  in  what  follows. 

Another  important  point  is  that,  even  when  the  model  is 
specified  in  advance,  the  validity  of  the  formula  used  for 
the  prediction  interval  rests  on  the  usual  assumptions  on 
the  errors  as  they  were  stated  in  the  introduction.  If 
the  residuals  associated  with  a  particular  model  indicate 
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gross  violations  of  those  assumptions  on  the  part  of  the 
errors,  w  becomes  a  meaningless  statistic.  Judging  the 
predictive  performance  of  various  models  on  the  basis  of 
such  a  statistic  would  be  quite  speculative  at  best.  There¬ 
fore,  when  W  is  employed  in  order  to  select  subsets  of 
variables,  it  is  important  that  a  check  on  the  assumptions 
be  made.  Appropriate  transformations  on  the  variables 
should  be  made  before  judgement  on  the  basis  of  W  is 
attempted.  This  observation  is  supported  by  the  analyses 
on  data  sets  which  will  be  discussed  in  Chapter  VI. 

Given  that  H  cannot  decrease  as  variables  are  added  to 
a  model ,  W  may  decrease  only  if  the  mean  square  error 
decreases  by  an  amount  sufficient  to  offset  the  increase  in 
M  and  F.  Thus,  augmenting  a  k-variable  model  to  reduce  w 
will  always  reduce  MSE,  so  that  W-optimal  models  will  tend 
to  contain  fewer  variables  than  MSE-optimal  models.  This, 
in  itself,  is  a  rather  desirable  property  in  view  of  the 
commonly  held  opinion  among  statisticians  that  the  minimum 
MSE  criterion  frequently  results  in  considerable  overfit¬ 
ting.  Of  course,  this  parsimony  of  the  "minimum  w"  cri¬ 
terion  is  not  guaranteed.  The  selection  may  take  different 
paths  for  the  two  criteria.  The  opposite  phenomenon  was 
observed  in  only  two  occasions  in  the  data  which  were 
analyzed. 

For  large  n,  W  will  be  dominated  by  the  factor  MSE, 
since  M  is  divided  by  n-1.  This  agrees  with  our  intuition 
since,  for  a  given  M,  the  point  x  will  be  inside  the 
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k-dimensional  scatter  of  the  data  if  n  is  large  more  so  than 
if  n  is  small.  Thus,  a  hyperplane  which  explains  the  data 
well  should  be  expected  to  predict  the  new  point  well  too. 
Otherwise,  for  a  given  MSE  and  a  given  M,  extrapolation  as 
it  was  defined  earlier,  is  more  extreme  with  a  small  n  than 
with  a  large  one.  The  increased  influence  of  M  as  n  de¬ 
creases,  however,  may  have  an  adverse  effect  on  the  fit.  A 
variable  may  be  excluded  that  is  found  desirable  on  other 
considerations.  It  may  be  prudent  in  such  cases  to  consider 
a  slightly  W-suboptimal  model  by  forcing  the  desirable  vari¬ 
able  into  the  regression  equation.  Investigations  confirm 
that  such  an  occurrence  is  possible.  At  the  same  time,  it 
was  found  that  a  careful  analysis  will  reveal  these  anomalies. 
An  investigation  into  the  variables  which  are  excluded  by 
the  minimum  w  criterion  may  provide  insight  into  aspects 
of  the  problem  which,  otherwise,  would  not  have  been  gained. 

The  W  criterion  partitions  the  p-dimensional  X-space 
into  well  defined  and  clearly  bounded  regions  in  which  dif¬ 
ferent  models  are  optimal.  For  purposes  of  illustration,  a 
simple  two  varible  example  was  used  in  order  to  obtain 
insight  into  the  nature  of  the  various  regions.  The  result 
is  depicted  in  Figure  2.  Six  observations  on  two  predictors 
X^  and  and  the  response  variable  were  used,  marked  by 
"+" .  For  each  one  of  the  four  models  containing  the  con¬ 
stant  term,  W  was  expressed  as  a  function  of  X^  and  Xj. 

The  "equi-W"  curves  (level  curves)  were  computed  and  drawn. 


Figure  2.  W-optimal  Models  in  Two-Dimensional  Space.  An  Illustration 


39 

The  model  producing  the  smallest  W  in  each  of  the  regions 
defined  by  the  level  curves  was  found. 

Notice  that,  as  x  traverses  any  of  the  boundaries,  the 
model  selected  by  W  changes.  As  a  result,  point  predictions 
change  in  a  discontinuous  way  as  a  boundary  is  crossed. 

This  is  a  somewhat  disconcerting  property  of  the  criterion, 
even  though  W  changes  continuously.  As  was  emphasized 
earlier,  however,  W  is  offered  as  a  screening  aid  and  not  as 
a  method  for  determining  one,  and  only  one,  model  to  be 
labeled  "best".  Seen  in  this  light,  for  certain  points  x, 
the  existence  of  more  than  one  model  with  almost  equal  W's 
should  indicate  the  need  for  further  investigation  of  their 
properties . 

The  interpretation  of  the  situation  for  points  in  the 
regions  where  no  variable  is  retained  also  merits  attention. 
A  point  x  in  such  a  region  is  very  non-analogous  to  the 
historical  data  (large  M) .  Yet,  if  only  the  constant  is 
used,  as  is  suggested  by  W,  the  point  prediction  will  be 
none  other  them  the  mean  of  the  historical  response  values. 
The  analyst  should  view  this  occurrence  as  a  suggestion  that 
x  and  X  are  sufficiently  non-analogous  so  as  to  vitiate  the 
entire  regression  approach  to  prediction  in  the  situation 
at  hand,  at  least  if  the  regression  is  to  be  based  on  the 
given  body  of  data  and  the  predictor  variables  under 
consideration.  Even  though  one  can  obtain  a  good  fit  be¬ 
tween  Y  and  £  in  the  historical  data,  there  is  no  strong 
justification  in  expecting  y  to  be  analogous  to  Y  if  x  and 


X  were  generated  by  different  processes.  Thus,  a  phenomenon 
which  seined  anomalous  at  first  glance,  acts  as  a  valuable 
warning  for  the  analyst  using  the  W  criterion.  In  fairness 
to  more  standard  approaches  to  prediction  it  is  acknowledged 
that  the  careful  analyst  should  become  aware  that  something 
is  amiss  upon  observing  the  large  prediction  interval  at  x 
based  on  his  selected  "best  fitting"  model. 

As  mentioned  earlier,  W  combines  fit  with  analogousness 
It  is  often  desirable  to  know  the  relative  sizes  of  these 
two  factors  for  a  given  model.  A  graphic  display  can  yield 
insight  into  data  and  enable  the  analyst  to  perceive  pat¬ 
terns  in  them  which  might  be  difficult  to  perceive  from 
numerical  procedures  and  tabular  displays.  In  the  situ¬ 
ation  at  hand,  such  a  display  of  the  magnitudes  of  M,  MSE 
and  W  could  help  the  analyst  choose  from  among  several  com¬ 
peting  models,  according  to  his  judgement  of  the  relative 
importance  of  each  factor.  For  each  subset  size  k  such  a 
display  can  be  constructed  by  first  observing  that  MSE  and 
W  are  expressed  in  squared  Y  units.  In  order  to  get 
unitless  quantities,  note  that 

MSE  -  Sy(n-l)  (1-R2)  /  (n-k-1)  (3.13) 


and,  therefore, 

W  (n-k-1) 

- - X 


(1-R  j  [M  + 


Pl-a;l,n-k-lSY 


(3.14) 
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Taking  natural  logarithms  of  both  sides  of  (3.14)  and 
rearranging  terms, 

2 

ini  M  +  2-^]  *  In  l - -  (n~k~1) - s]  -  £n(l-R2).  (3.15) 

n  p 

*1-  <*;  l,n-k-l  aY 

So,  for  fixed  k,  points  representing  models  with  equal  W's 

2 

lie  on  a  line  with  slope  -1,  on  a  graph  of  £n[M  +  (n  -l)/n] 

2 

versus  ln(l-R ).  The  intercept  of  such  lines  is  determined 

by  W.  For  a  given  W,  models  with  small  MSE  and  large  M  will 

be  located  high  on  the  "equi-W"  line,  while  models  with 

large  MSE  and  small  M  will  be  located  on  its  lower  part. 

For  a  clearer  picture  of  the  relative  sizes  of  W  across 

model  sizes,  k,  these  lines  may  be  labeled  by  W  (or  the 

2 

width  of  the  prediction  interval).  Since  £n(l-R  )  ^  0,  it 
might  be  preferable  to  set  the  origin  at,  say,  (-5,0),  so 
as  to  have  most  of  the  points  in  the  first  quadrant.  Two 
examples  were  used,  differing  on  the  total  number  of  vari¬ 
ables  involved.  For  the  first  example,  the  data  on  page  366 
in  Draper  and  Smith  (10)  were  used.  This  data  set  involves 
thirteen  observations  on  four  predictor  variables.  The 
response  variable  measure  the  heat  evolved  during  the 
hardening  of  cement  containing  chemical  substances  which 
are  measured  by  the  four  predictors.  The  last  row  was 
deleted  from  the  data  base  and  predicted.  The  resulting 
plots  for  1,  2  and  3-variable  models  are  shown  in  Figures  3, 
4  and  5  respectively,  with  the  lines  labeled  by  the  width 
of  the  prediction  interval.  Models  which  were  not  selected 


Figure  3.  Graphic  Display  of  Analogy  Versus  Fit.  One- Variable  Models 


Figure  4.  Graphic  Display  of  Analogy  Versus  Fit.  Two-Variable  Models 
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by  any  criterion  are  marked  by  a  "+" .  For  those  which  were 
selected,  the  legends  indicate  the  corresponding  criteria. 
(The  models  marked  as  FORWARD  and  BACKWARD  will  be  discussed 
in  the  next  chapter) . 

As  can  be  seen  from  these  displays,  some  of  the  two- 

variable  models  seem  to  be  pointed  out  by  several  criteria. 

The  model  with  variables  X^  and  X2  results  in  better  fit  of 

the  data  as  its  location  in  the  graph  indicates.  However, 

the  point  under  prediction  is  rather  non-analogous  to  the 

data  along  these  variables.  Thus,  the  model  with  variables 

X3  and  X4,  although  it  is  associated  with  a  larger  MSE 
2 

(smaller  R  ) ,  results  in  a  slightly  smaller  W.  These  two 
models  and,  perhaps,  the  one  with  variables  X1  and  X4  and 
the  one  with  variables  X. ,  X_  and  X.  would  be  the  ones 
passing  the  first  screening  and  scrutinized  further. 

For  a  second  example,  the  data  on  page  352  in  Draper 
and  Smith  (10)  were  used.  They  consist  of  twenty- five 
observations  on  nine  predictor  variables.  The  response 
measures  the  pounds  of  steam  used  monthly  in  a  glycerine 
producing  operation.  The  eighth  row  was  set  aside  and 
predicted.  This  row  was  selected  since  it  has  been  dis¬ 
cussed  in  (1) .  A  preliminary  investigation  suggested  that 
variables  x^  and  X^  were  not  important  carriers  of  infor¬ 
mation  in  the  sense  that  they  were  not  involved  with  any  of 
the  good  models  and  they  were  the  first  ones  to  be  elimi¬ 
nated  by  a  BACKWARD  procedure  based  on  minimum  reduction 


•i 
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2 

in  R  .  To  reduce  computation  requirements  and  clutter  in 

plots,  they  were  not  considered  for  further  study.  Figures 

6-11  depict  the  situations.  Because  of  the  large  number  of 

models,  only  two  "equi-W"  lines  were  drawn  for  each  graph 

for  reference  purposes  across  model  sizes.  They  correspond 

to  the  smallest  and  the  median  Ws  and  they  are  labeled  by 

the  width  of  the  prediction  interval. 

This  is  a  well  behaved  data  set  in  that  the  models 

2 

suggested  by  most  criteria  have  the  largest  R  of  their 

respective  sizes.  Also,  as  the  graphs  indicate,  the  point 

under  prediction  is  rather  analogous  to  the  data  along  all 

dimensions  (variables) .  The  scales  on  the  axes  are  the  same 

on  all  graphs,  making  clear  the  general  increase  in  the 

Mahalanobis  distance  as  the  number  of  variables  increases. 

The  model  with  variables  X2,  X 4,  Xg,  Xg,  Xg  and  X1Q,  as 

these  are  labeled  in  Draper  and  Smith,  was  selected  by  the 

minimum  W,  the  minimum  MSE  and  the  minimum  criteria. 

Notice  that  the  model  suggested  by  the  Mean  Square  Error  of 

Prediction  criterion  (variables  X4  and  X7)  seems  to  be 

2 

unacceptable  on  all  other  counts.  It  provides  a  R  *  0.423, 
which  is  very  small  compared  with  those  of  many  other  models. 
As  a  result,  the  prediction  interval  associated  with  it  is 
very  wide.  This  should  underscore  the  fact  that  variable 
selection  criteria  are  not  universally  applicable  and  can 
often  lead  to  models  which  a  careful  analysis  may  find  un¬ 
acceptable.  Therefore,  they  should  be  used  with  prudence 
and  be  accompanied  by  a  careful  analysis  of  the  models  they 


Figure  7.  Graphic  Display  of  Analogy  Versus  Fit.  Tvro-Variable  Models 


£n  (1-R  ) 

Graphic  Display  of  Analogy  Versus  Pit.  Pour- Variable  Models 


Figure  10.  Graphic  Display  of  Analogy  Versus  Fit.  Five-Variable  Models. 


Graphic  Display  of  Analogy  Versus  Fit.  Six-Variable  Models 
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suggest  along  more  than  one  lines.  Model  building  should 
not  be  reduced  to  a  mechanical  selection  of  variables  by 
any  criterion. 

In  this  chapter,  the  W  criterion  was  developed  and 
discussed.  Its  intuitive  appeal  in  the  specific  problem  of 
prediction  at  a  known  point  is  based  on  the  fact  that  it 
has  the  potential  of  focusing  attention  to  an  aspect  of  the 
problem  which  could  have  been  ignored  otherwise.  As  is  the 
case  with  every  data  analytic  technique,  the  art  of  using 
this  criterion  to  advantage  must  be  developed  through 
experience.  In  subsequent  chapters,  this  methodology  will 
be  applied  to  real  data  sets  in  an  effort  to  understand  its 
properties.  This  should  help  in  learning  how  to  exploit 
its  strong  points  and  avoid  its  weaknesses. 


CHAPTER  IV 


COMPUTATION 

A  Branch  and  Bound  Algorithm 
The  need  for  the  selection  of  a  subset  of  variables 
becomes  more  imperative  as  the  number  p  of  potential  pre¬ 
dictors  becomes  large.  At  the  same  time,  since  the  compu¬ 
ting  time  needed  for  a  search  of  all  2^-1  possible  regres¬ 
sions  increases  exponentially  in  p,  it  is  clear  that,  for 
large  p,  a  full  search  may  be  well  beyond  the  budget  con¬ 
siderations  of  the  analysis.  An  algorithm  which  will  iden¬ 
tify  the  good  models  without  actually  performing  all  2P-1 
regressions  is,  in  such  cases,  highly  desirable.  Such 
algorithms  exist  for  criteria  which  are  simple  functions  of 
the  sum  of  squared  errors.  The  most  efficient  of  these  take 
advantage  of  the  fact  that  the  sum  of  squared  errors  asso¬ 
ciated  with  a  model  is  a  lower  bound  on  the  stuns  of  squared 
errors  of  its  submodels.  In  1974,  Furnival  and  Wilson  (12) 
suggested  a  branch- and- bound  algorithm  whose  efficiency  is 
enhanced  by  the  fact  that  the  search  is  made  by  a  simul¬ 
taneous  traversing  of  two  trees,  one  for  bounds  and  one  for 
regressions.  A  semi-SWEEP  operator  is  employed  for  the 
entering  or  removing  of  variables  and  the  matrices  needed 
at  each  stage  are  available  from  previous  SWEEP'S.  This 
algorithm  is  the  most  efficient  one  known  to  date  and 
problems  involving  30  variables  are  well  within  its  reach. 
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An  attractive  feature  of  this  algorithm  is  that  the  "best" 
m  models  for  each  subset  size  )c  can  be  output  without  great 
loss  of  efficiency. 

The  technique  proposed  in  the  previous  chapter  is  not 
simply  related  to  the  sum  of  squared  errors,  since  it  in¬ 
volves  the  coordinates  of  x.  Thus,  for  a  given  model  size 
k,  the  model  with  smallest  sum  of  squared  errors  may  not 
yield  the  smallest  W.  This  implies  that  the  bounds  utilized 
in  Fumival's  algorithm  cannot  be  used  for  a  similar  search 
for  models  with  small  W.  Somewhat  less  sharp  bounds  can 
nevertheless  be  obtained  so  that,  with  minor  modifications, 

Fumival's  approach  cam  be  adapted  to  the  case  at  hand. 

2 

A  univariate  one- sample  t  statistic  is  defined  by 
2  _  2  2  — 

t  *  n(x-X)  /s  ,  where  X  is  the  mean  of  a  sample  of  size  n 

on  a  variable  X,  x  is  an  independent  observation  on  the  same 
2 

variable  and  s  is  the  sample  variance  of  X.  This  is  the 

,  2 
exact  univariate  analog  of  Hotelling' s  one-sample  T  .  Actu- 

2  —  -1  — 

ally,  T  -  n(x-X)S  (x-X) '  is  the  square  of  the  univariate 

t-ratio  of  that  linear  combination  of  the  variables  which 

inflates  the  t-ratio  the  most.  (For  a  clear  explanation  of 

2 

the  derivation  of  T  see  (17)).  A  univariate  t-ratio  cor¬ 
responds  to  one  such  linear  combination  and  therefore  T2 
must  be  greater  than  or  equal  to  the  largest  squared  uni¬ 
variate  t-ratio.  All  p  univariate  t- ratios  can  be  computed 

2 

and  saved.  Thus,  a  lower  bound  on  T  associated  with  any 
set  of  X's  is  obtained.  Since  the  Mahalanobis  distance 
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M  =  T*7n,  a  lower  bound  is  also  obtained  on  M.  Recalling 


(3.11)  , 


_  SSE  .n+1  .  _M, 

rl-ct; l,n-k-l  n-k- 1  n  n-lJ  ' 


if  WL  denotes  the  smallest  W  currently  available  for  models 
of  size  i  at  any  stage  of  the  search,  then  the  submodels 
derived  from  a  model  of  size  k  need  not  be  examined  if  the 
sum  of  squared  errors  of  that  model  is  greater  than: 
n(n-l)  (n-i-l)W? 

2 

for  all  i  *  l,...,k-l,  where  t^j  rs  the  i-th  largest  uni- 
2 

variate  t  .  Notice  that  this  quantity  needs  to  be  calcu- 

* 

lated  only  when  a  model  which  improves  is  encountered. 

For  sharper  bounds,  this  quantity  can  be  recomputed  at 

2  2 
every  stage  using  for  t^j  the  i-th  largest  univariate  t 

among  the  variables  in  the  model  under  consideration. 

Empirical  experience  with  this  algorithm  suggests  that 

it  is  approximately  5-10%  less  efficient  when  applied  to  w 

than  when  it  is  applied  to  other  criteria  which  are  simple 

functions  of  the  sum  of  squared  errors.  As  noted  by 

Furnival,  time  requirements  are  heavily  data  dependent. 

This  observation  obviously  applies  to  the  W  criterion  as 

well,  in  which  case  efficiency  will  also  depend  on  the  value 

of  x.  The  importance  of  this  last  dependency  diminishes 

for  large  n.  In  general,  the  efficiency  of  this  algorithm 

when  applied  to  the  W  criterion  should  be  such  that  problems 
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of  comparable  size  can  be  handled  without  much  additional 
investment  in  computing  time. 

A  Stepwise  Algorithm 

It  is  often  the  case  that  a  suboptimal  stepwise  search 
is  used  in  lieu  of  computing  all  regressions,  especially  in 
an  early  screening  of  a  very  large  number  of  variables. 
Various  stepwise  procedures  have  been  widely  used  for  this 
purpose,  all  of  which  are  variations  of  FORWARD  selection 
and  BACKWARD  elimination  (see  e.g.  (10)).  These  techniques, 
based  on  the  SWEEP  operator  (29)  ,  were  designed  for  criteria 
which  are  simple  functions  of  the  residual  sum  of  squares 
and,  hence,  must  be  modified  to  deal  with  the  W  criterion. 
The  computational  complications  associated  with  the  W  cri¬ 
terion  can  be  overcome  by  exploiting  the  monotonicity  pro¬ 
perty  of  Theorem  3.1.  The  SWEEP  operator  must  be  briefly 
considered  first,  in  order  to  locate  the  quantities  needed 
for  a  FORWARD  selection  and  a  BACKWARD  elimination  algorithm 
based  on  the  minimum  W  criterion. 

Given  an  originally  symmetric  positive  definite  matrix 
A,  the  SWEEP  operator  applied  to  the  k-th  diagonal  element 
of  A  is  defined  as  follows: 

Step  1:  Let  D  *  a^. 

Step  2:  Divide  row  k  by  D. 

Step  3:  For  every  other  row  i  #  k,  let  B  «  a^. 

Subtract  B  x  row  k  from  row  i.  Set  a^  ■  -B/D. 

Step  4:  Set  a^  «  1/D. 
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If  a  SWEEP  is  performed  on  diagonal  element  i,  variable  X^ 
is  added  to  the  current  regression  equation  unless  X^  is 
already  in  the  regression  in  which  case  X^  is  removed. 
Observe  that  the  result  of  a  SWEEP  is  an  absolutely  sym¬ 
metric  matrix,  i.e.,  a  matrix  such  that  a^j  *  a^  if  an 
equal  number  of  SWEEP'S  have  been  performed  on  elements  i 
and  j,  and  a^^  *  ~aji  ot^erwise*  Thus,  only  the  upper  tri 
angular  part  of  A  need  be  computed. 

For  the  purposes  of  the  W  criterion,  A  must  be  set 
initially  to  the  corrected  sums-of-squares-cross-products 
matrix  of  the  data,  i.e. , 


A  * 


S'X  X' Y 
Y'  X  Y'Y 


where  X  in  this  section  will  denote  the  original  matrix  X 
corrected  for  the  means  X. , . . . ,X  and  Y  will  denote  the 
original  vector  Y  corrected  for  the  mean  Y.  In  more  familiar 
statistical  terms,  the  matrix  A  is  the  covariance  matrix  of 
the  data  multiplied  by  (n-1) .  Variables  are  entered  and 
deleted  by  sweeping  on  the  corresponding  diagonal  elements 
of  A.  After  each  SWEEP,  statistics  on  the  regression  of  Y 
on  the  variables  which  have  been  swept  in  and  submodel 
information  are  available.  To  illustrate,  suppose 


Mi 

*1*2 

M 

A  * 

Mi 

m2 

M 

r*! 

Th 

Y'Y 
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where  X.  contains  some  of  the  columns  of  the  data  matrix  X. 
*1  a 

Sweeping  A  on  the  diagonal  elements  of  X£g^  yields: 


<8i5i» 


'lSiS2 


-SiSifSiSi*'  Sj«iS2 
-X'S^SiSi’'1  X'«i22 


a2«iX 


where  ^  (5^)  ~X^. 

The  rightmost  column  contains  the  regression  coeffi¬ 
cients  and  the  sum  of  square  errors  of  the  regression  of  Y 
on  the  variables  contained  in  X^.  The  part  (X^X^)  “X:jj£x2 
gives  the  coefficients  of  the  regressions  of  each  variable 
in  X2  on  the  variables  in  and  the  diagonal  elements  of 
^2M1^2  give  the  sum  of  square  errors  of  the  same 
regressions. 

For  a  FORWARD  selection  algorithm,  the  reduction  in  SSE 
and  the  increase  in  the  Mahalanobis  distance  resulting  from 
the  inclusion  of  a  variable  among  those  in  X2  can  easily  be 
computed.  To  illustrate,  suppose  that  variables  X^, • • • /X^ 
have  already  been  swept  into  the  regression.  The  quantities 
needed  for  the  computation  of  the  new  W,  say  W y  resulting 
from  the  inclusion  of  variable  ,  j  *  k+l,...,p,  are: 


*<  -  *  -  l  B  X.  +  l  B..X. 
3  i-1  13  1  i-1  i3  1 


(4.1) 


s  j  j  (1-r  )  -  Bjj/(n-l) 


(4.2) 


and 
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SSEj  "  B(p+l)(p+l)  "  Bj (k+1) /Bj j  (4*3) 

2 

where  Sjj(l-r  )  denotes,  as  in  Chapter  III,  the  conditional 
variance  of  X.  on  .  Therefore,  using  the  mono- 

tonicity  property  of  Theorem  3.1,  the  new  W  resulting  from 
the  introduction  of  variable  Xj  into  the  regression  equation 
is 


W  .  =  F 


Jffi  ta±i  +  Jl_ 

l-a;l,n-k-l  n-k-2  1  n  n-1 


bJ3 


1  (4.4) 


where  M  is  the  Mahalanobis  distance  associated  with  vari¬ 
ables  X^.-wX^.  Thus,  variable  Xj  is  the  next  variable 
to  sweep  into  the  regression,  if  VT  *  min{W^,  i  =  k+l,...,p} 
For  a  BACKWARD  elimination  algorithm,  the  variable  to 
be  swept  out  is  determined  as  follows: 

If  variable  Xj  ,  j  *  l,...,k  is  deleted,  the  error  sum  of 
squares  becomes 


SSEj 


B(p+1)  (p+1) 


+  B 


2 

j  (P+D 


(4.5) 


For  the  new  Mahalanobis  distance,  the  coefficients  of  the 
regression  of  Xj  on  X^,  i  *  l,...,k,  t  t  j  need  to  be 
computed.  These  are  given  by 


Ci  *  ”Bij /Bj j '  1  1  #  j* 


Thus, 


k  k 

-  *  -  ,1.  Ci*i  +  I,  Ci*i- 


i-1 


i-1 

i«4 


(4.6) 
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where  M  is  as  noted  above.  Thus,  variable  X^  is  the  next 
variable  to  be  swept  out,  if  VT  *  mintw^  i  =  l,...,k}. 

Given  the  monotonicity  result  of  Chapter  III,  in  FORWARD 
selection  the  computed  M  is  added  to  the  current  Mahalanobis 
distance,  while  in  BACKWARD  elimination  it  is  subtracted. 

The  current  Mahalanobis  distance  must  be  saved  at  each  stage. 

The  computing  time  requirements  for  such  algorithms 
pose  no  limitations  on  their  applicability  in  problems  of 
sizes  normally  encountered  in  practice.  At  issue  is  the 
degree  to  which  the  models  selected  for  each  subset  size 
differ  from  the  ones  found  optimal  by  the  criterion  employed 
when  a  full  search  is  done.  In  order  to  gain  some  insight 
into  this,  the  two  stepwise  procedures  were  applied  to  the 
data  sets  used  in  Chapter  III.  The  models  selected  by  the 
FORWARD  selection  and  the  ones  selected  by  the  BACKWARD 
elimination  are  shown  in  Figures  3-11.  In  the  four  variable 
example,  the  algorithms  identified  the  better  models  for 
each  subset  size.  The  BACKWARD  procedure  identified  the 
best  one-variable  model,  while  the  FORWARD  procedure  located 
the  best  three  variables.  They  both  missed  the  overall 
optimal  model  (Xj,X4),  however,  the  two-variable  model 
selected  by  BACKWARD  elimination  had  a  W  very  close  to  the 
optimal  one. 
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In  the  second  example,  depicted  in  Figures  6-11,  the 
two  procedures  selected  the  same  model  for  all  model  sizes. 
Observe  that,  in  all  cases,  the  models  selected  by  these 
sub-optimal  procedures  coincided  with  the  minimum-W-optimal 
ones.  This,  of  course,  is  the  most  desirable  situation. 

The  degree  to  which  it  will  happen  in  practice  depends  on 
the  particular  set  of  data.  However,  if  these  two  examples 
offer  any  indication,  it  seems  that  the  stepwise  procedures 
can  fruitfully  be  employed  either  in  thinning  down  a  large 
number  of  variables  to  a  subset  on  which  a  full  search  by 
means  of  the  b ranch- and- bound  algorithm  will  be  economically 
feasible,  or  by  themselves.  The  better  models  of  each  sub¬ 
set  si2e  should  be  identified  at  least  in  the  cases  of  well 


behaved  data  sets 


CHAPTER  V 


ON  THE  INFLUENCE  OF  OBSERVATIONS  ON  W 

Theory  and  Discussion 

The  influence  of  individual  observations  on  the  various 
quantities  of  interest  in  a  statistical  analysis  of  data  has 
received  considerable  attention  in  the  recent  literature 
(8)  ,  (13)  ,  (18)  ,  etc.  It  is  argued  that  observations  which 
significantly  affect  (have  high  leverage  on)  such  quantities 
ought  to  be  given  careful  scrutiny.  The  object  is  to  detect 
"outlying"  points  and  to  investigate  them  further,  examining 
whether  the  analysis  can  be  enhanced  by  setting  them  aside. 
Possible  errors  of  transcription ,  for  instance,  might  be 
discovered.  More  realistically,  in  cases  of  designed  ex¬ 
periments,  such  knowledge  may  prove  useful  in  suggesting 
ways  in  which  the  design  may  be  improved.  Taking  more 
measurements  in  the  space  of  the  explanatory  variables  could 
improve  the  analysis. 

Identification  of  outliers  does  not  necessarily  imply, 
or  argue  for,  the  rejection  of  such  points.  It  is  only 
meant  as  a  tool  for  the  analysis  of  data  and  should  be  used 
with  caution.  Nevertheless,  the  inclusion  of  faulty  data 
can  adversely  affect  the  analysis  to  a  substantial  degree. 
This  point  has  recently  received  attention  in  the  liter- 
ture.  .  Hoaglin  and  Welsch  (18)  studied  the  "hat"  matrix 
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X(X'X)  g'.  They  suggested  an  approach  combining  the 
information  carried  by  the  hat  matrix  and  the  studentized 
residuals  in  an  effort  to  discover  exceptional  and/or  dis- 
c repant  points. 

Cook  (7)  proposed  the  distance 

Di  *  l<b-b(i))  ’X'X(b-b(i))]/(kxMSE)  , 

where  b^j  denotes  the  estimated  coefficients  obtained 
without  observation  i,  as  a  measure  of  the  influence  of  the 
i-th  data  point.  He,  too,  related  such  influences  to  the 
hat  matrix,  the  studentized  residuals  and  residual 
variances. 

Welsch  and  Peters  (36)  suggested  methods  for  examining 
more  than  two  observations  at  a  time  and  placed  emphasis 
on  the  computational  aspects  of  these  diagnostic  measures. 

Gentleman  and  Wilk  (13)  developed  analysis  of  variance 
methods  to  identify  outlying  subsets  of  K  observations. 

The  investigations  above  are  mainly  concerned  with  the 
influence  of  outliers  on  the  parameter  estimates  rather  than 
prediction.  In  the  context  of  this  investigation,  an  obser¬ 
vation  (or  group  of  observations)  may  be  termed  exceptional 
if  W  changes  significantly  when  that  observation  (or  group 
of  observations)  is  set  aside  and  the  least  squares  calcu¬ 
lations  are  performed  on  the  reduced  data  set.  The  effected 
change  in  W  will  be  investigated  by  means  of  the  ratio  of 
prediction  variances.  Some  new  notation  will  facilitate  the 
exposition  of  this  chapter. 
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Partition  the  matrix  X  as  {X^,X£}',  where  X^  is  n^xp 

and  £ 2  is  n2xP  with  n^+n2  *  n.  The  vector  Y  is  partitioned 

in  like  manner  into  components ■  Y^  and  Y2<  without  loss  of 

generality,  assume  that  the  observations  in  (X2,Y2)  are  set 
2  2 

aside.  Let  s  and  s^  denote  the  mean  square  errors  of  the 
full  model  and  the  submodel,  respectively.  A  superscript 
(1)  will  indicate  that  the  quantity  to  which  it  is  attached 
has  been  computed  from  the  regression  using  X^  and  Y^  only. 
Let  e^  and  e2  be  the  n^xi  and  n2xl  vectors  of  residuals  cor¬ 
responding  to  Y^  and  Y2  respectively  when  the  full  data  base 
is  used.  In  accord  with  the  convention  stated  above,  then 
e|^  and  e ^  would  be  the  vectors  of  residuals  correspon¬ 
ding  to  Y1  and  Y2  resulting  from  fitting  the  model  to  and 
Y^.  A  well  known  result  (for  example  see  Bingham  (4)) 
yields 


s 


2 

1 


<.-'-p)s2-e'[r-x2(X'X)-lx;,r1£2 

(nx“p) 


(5.1) 


where  I  denotes  the  n2xn2  identity  matrix.  Letting 
,  s2[l+x(X'X)”1x'l 

t  -  - = - rr —  (5.2) 

sjtl+x^^)  Ax'l 


i.e.,  the  ratio  of  prediction  variances  at  x  and  substi¬ 
tuting  (5.1)  into  (5.2),  it  is  easy  to  show  that 

2  ni*P 

Y2  Q*H,  (5.3) 


where 
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1  +  2 
(n^-p)  sl 


and 

tl+x(5,^)”1x'l 

H  »  - — -  . 

[1+xtX^)  V] 


Using  another  identity  from  (4) ,  namely 

e2  »  [l+X2(X^1)'1X^]’1e^1)  -  CI-S2(X,xr1^]e^1) 


(5.4) 


Y  can  also  be  written  as 


2  nl”P  (1) 

Y  -  Q  |A,xH 

n-p 


(5.5) 


where 

0(i)  ,  , 

Q  *  1  +  - - -  -mA . —  111  . 

(n^piSj^ 

Relations  (5.3)  and  (5.5)  express  the  ratio  of  prediction 
2 

variances,  y  ,  in  terms  of  quantities  which  yield  to  intui¬ 
tive  interpretations.  These  quantities  are  studied  next,  in 
an  effort  to  isolate  the  characteristics  of  observations 
whose  deletion  results  in  a  significant  change  in  w. 

It  is  clear  that  a  reduction  in  W  is  obtained  if,  and 
only  if, 

Y  >  Fl-a; l,nj-p^Fl-a; l,n-p* 
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2 

Consider  the  three  factors  comprising  y  .  Clearly/ 

{ (n^-p) / (n-p) }  <  1.  The  following  theorem  shows  that  the 
last  factor  also  has  this  property. 

Theorem  5.1:  - ? -  1. 


n2 


Proof;  It  suffices  to  prove  the  result  for  the  case 
1.  Since  g’g  -  +  g£g2  and 


<5iSi> 


it  follows  that, 

S<Si5i>‘V 


5(g'gl~lx* 


*  i+5j(SiSi>'lSi‘ 


(5.6) 


Since  (Sx&i*”1  *s  P08***-^  definite,  the  second  term  in  the 
right  hand  side  of  (5.6)  is  also  positive,  completing  the 
proof. 

The  second  factor  in  (5.5)  is  greater  than  one.  This 
is  so  because  the  matrix  I+X2  (XJgx) _1g£  is  positive  defi¬ 
nite,  being  the  sum  of  positive  definite  matrices  and, 

therefore,  tI+^2 (^i^l)"1^21"1  is  al*°  P°*itive  definite. 

2 

Since  only  the  last  two  factors  in  y  depend  on  the  compo¬ 
nents  of  X2,  they  will  expose  the  characteristics  of  obser¬ 
vations  which  affect  W.  The  factor 

2  r  -  — 


is  studied  first. 
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Observe  that  !  I+X2  (gjj^)”*^  |  a  measure  °f  the  col 

lective  distance  of  the  points  in  £2  from  the  rest  of  the 
data.  This  is  meant  in  the  following  sense:  Clearly,  if 
n2  *  1,  then 


x+x2(x^x1)"1x^ 


n+1  M 
n  n-1 


where  M  is  the  Mahalanobis  distance  of  the  point  X2  ^rom  the 
data  base  When  n2  >  1,  a  large  !  I+X2  (X^X^)  |  will 

indicate  that  the  points  in  X2  are  either  far  from  the 
centroid  of  X^,  or  that  their  covariance  structure  is  dif¬ 
ferent  from  that  of  X^,  or  both.  Thus,  for  a  reduction  in 
W,  the  determinant  of  I+X2 (X^X^)~^g^  must  be  small.  That 
is  to  say,  other  things  being  equal  in  (5.5),  points  near 
the  centroid  of  are  more  likely  to  cause  W  to  decrease 
when  they  are  omitted.  This  should  be  intuitively  appealing. 
For  the  variances  of  the  estimated  coefficients  to  be  small, 
the  data  points  must  be  widely  dispersed  in  X-space. 

Next,  observe  that  the  residuals  £2^  must  be  large  in 
absolute  value.  In  other  words,  the  Y  values  of  the  obser¬ 
vations  to  be  set  aside  must  be  discrepant  in  the  sense 
that,  when  the  model  is  built  on  the  remaining  n^  observa¬ 
tions,  (X2,Y2)  are  not  fit  well.  Cook  (8),  Hoaglin  and 

2 

Welsch  (18)  and  others  have  linked  the  residual  t2  with  the 

influence  of  the  set  (£2,Y2)  on  the  coefficient  estimates. 

2 

Also,  as  should  be  obvious,  s^  should  be  small.  For  diag¬ 
nostic  purposes,  it  will  be  more  convenient  to  look  at  the 
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factors  comprising  t2  simultaneously.  Notice  that 
2  -1 

s^tl+Xj (X^^l  ££]  is  the  usual  estimate  of  the  covariance 

matrix  of  the  residuals  e2^  (which,  incidentally,  are  the 

2 

basis  for  Allen's  PRESS  criterion).  Therefore,  t2  can  be 

viewed  as  a  collective  studentized  residual  corresponding  to 

2 

the  omitted  set  (X2,Y2).  In  fact,  when  n2  *  1,  t2  reduces  to 


*il1+S2<SiSi>  a 


which  is  exactly  what  is  called  the  studentized  residual. 
When  the  rows  in  X2  have  been  specified  in  advance,  the 
quantity 


tj  / 


-1(1) 


n+Sa  <»!*!>  S21  «2 

v! 


is  distributed  as  F  with  n2  and  n^-p  degrees  of  freedom 

2  2 

since  the  numerator  is  distributed  as  o  x  (n2)/n2,  inde¬ 
pendently  of  the  denominator  which  is  distributed  as 
2  2 

a  x  (n^-p) / (n^-p) .  Thus ,  observations  whose  collective  stu¬ 
dentized  residual  is  significantly  large  ought  to  be  inves¬ 
tigated  further.  It  should  be  noted  that  t2  depends  not 
only  on  the  individual  residuals,  but  on  their  correlations 
as  well.  Although  in  practice  observations  whose  studen¬ 
tized  residuals  are  small  rarely  reduce  W  significantly  when 
they  are  combined  with  others,  this  is  not  always  the  case. 
Cases  have  been  observed  where  the  pair  which  causes  W  to 
decrease  the  most  consists  of  observations  which,  if  deleted 


individually,  would  cause  W  to  increase.  To  such  a  "mask* 

2 

ing"  effect  the  last  factor  in  Y  may  contribute  signifi¬ 
cantly.  One  way  to  look  at 

l+xtX'SrV 

l+xlX^)’1*' 


is  as  a  measure  of  the  relative  distances  from  x  to  X  and  X.. 

—  a  a] 

respectively.  As  noted  above,  this  factor  cannot  be 
greater  than  one.  For  a  maximum  reduction  in  w,  it  should 
be  as  close  to  one  as  possible.  This  would  imply  that  the 
deletion  of  does  not  greatly  increase  the  Mahalanobis 
distance  from  x  to  the  data  base.  This  factor  can  also  be 
studied  in  terms  of  residual  correlations.  Notice  that 


2  tl+x^fV  Hi+xIS'S)"1*'] 

i+x^r1*’ 


(5.7) 


A  M  )  A  |  1  J 

Recall  that  the  residuals  y-y  and  ^2“Y2  have  a  samp¬ 
ling  distribution  which,  under  the  usual  assumptions  on  e, 
is  normal  with  mean  vector  0  and  covariance  matrix 


Simple  algebra  will  show  that  n  is  the  square  of  the  multi- 

A  (  I  )  A  (1) 

pie  correlation  between  y-y  and  Y2"Y2  *  observation 

allows  the  following  intuitively  obvious  statement:  For  a 
reduction  in  W,  the  deleted  observations  (g2*X2)  should  not 


contribute  significantly  in  the  explanation  of  the  vari¬ 
ability  in  y,  beyond  that  which  is  provided  by  the  retained 
rows 

It  may  be  of  interest  to  note  the  similarity  between 

row  deletion  and  variable  selection.  In  the  latter,  the 

(^) +(?)+...+ (^) ,  p  <  n  submodels  are  investigated,  in  order 

to  find  the  minimal,  in  some  sense,  subset  of  variables 

which  adequately  explains  the  data.  In  the  former,  the 

model  is  kept  fixed  and  the  possible  (?)  +  (-)+.. .+(n  ), 

x  •  i^2 

n2  <<  (n-p)  data  subsets  are  explored  in  order  to  find  the 
maximal  set  which  is  adquately  explained  by  the  model.  The 
postulated  model  form  is  held  fixed,  as  the  notion  of 
"outlier"  is  valid  only  relative  to  a  prespecified  model 
form.  The  notation  above  indicates  that  n2  must  be  small 
relative  to  n-p,  in  order  to  have  a  sufficient  number  of 
error  degrees  of  freedom  left.  Notice  that,  as  n2  -*■  n-p, 
the  sum  of  the  squares  of  the  residuals  approaches  zero, 
thus  creating  a  false  sense  of  security.  In  practice,  if 
observations  were  to  be  deleted  one  at  a  time  as  long  as 
some  measure  of  fit,  or  W,  "improved”,  most  of  the  time  all 
error  degrees  of  freedom  would  be  exhausted.  This  can  be 
seen  as  follows:  Observe  first  that  the  hat  matrix 
H  «  is  a  projection  matrix,  i.e.,  HxH  •  H,  and 

as  such,  it  has  all  its  eigenvalues  equal  to  zero  or  one 
((16),  Thxm.  1.7.2,  p.  39).  The  number  of  nonzero  eigen¬ 
values  is  equal  to  the  rank  of  H.  In  the  full  rank  case. 
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rank(H)  »  rank  (X)  -  p.  Hence,  trace (H)  -  p,  i.e., 
n 

£  h. .  ■  p.  Also,  since  the  ratio  of  the  mean  square  error 
i-1  11 

of  the  full  model  to  that  of  the  reduced  one  is  equal  to 
t (n-p-1) / (n-p) ] Cl+tj/ (n-p-1) ] ,  it  follows  that  the  mean 
square  error  will  decrease  if  and  only  if  an  observation 
with  tj  >  1  is  deleted.  Finally/  let 

t2  M 

s2[l-g2 

2  2 

Then  t2  >  1  <=>  t  >1.  These  observations  and  the  lemma 
which  follows  will  help  in  proving  Theorem  5.2  below. 


Lemma  5.1.  Let  zi'z2'***'zn  k®  set  non-negative 

-  i  n 

numbers.  Let  z  »  -  l  z. ,  and  let  a, ,a.,...,an  be  another 

n  1  x  2  n 

i  x  n 

set  of  non-negative  numbers  such  that  \  a.  *  n.  Then 

i-1  1 

there  exists  i  such  that  z^ta^)  >.  1. 


Proof;  Suppose  that  z^/(a^z)  <  1  for  all  i.  Then, 


n 


n 


_  n  n  _  ..  _  .. 

z.  <  a.z  for  all  i  <*>  \  z.  <  J  a.z  <»>  \  z.  <  z  \  a. 
11  i-1  1  i«l  1  i-1  1  i-1  l 


n  _  n  n 

<->  1  z.  <  nz  <->  l  z.  <  l  z.,  which  is  obviously 

i-1  1  i-1  1  i-1  1 


false. 


Note:  Either  there  exists  an  i  such  that  zi/(aiz)  >  1, 
or  z^/ia^z)  *  1  for  all  i. 


Theorem  5.2.  With  probability  one,  there  exists  at 
least  one  observation  which,  if  deleted,  will  cause  the 
mean  square  error  to  decrease. 
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Proof:  Equivalently,  using  the  observations  made  above, 
there  exists  at  least  one  i  for  which 


t2  * 


e2 

1 


3  <l"hii) 


1. 


Write 


*' <l-hu> 


n  e . 

<1“hii)  .1^ 


i-1 


n-p 


n(1-hii)  1  ?  2 

n-p  n  i£1ei 


Notice  that 


n 

1 

i-1 

n(l-hi;L) 

■A  j^ii* 

n-p 

-  r—  (n-p)  -  n. 
n-p 

Now, 


n  .? 


(1-h. .)  r—  l 
n  n-p  n 


i-1 


and  the  claim  follows  from  Lemma  5.1  by  letting 

A  ?  e2 

n  i-1  x 


e2,  2 


n ( 1-h . .) 
and  a.  - — - 

i 


n-p 


With  respect  to  W,  this  result  implies  the  following: 

2 

Given  that  the  factor  1-n  is,  for  most  observations,  close 
to  one,  especially  when  n  is  large  relative  to  p,  there  will 
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tend  to  exist  at  least  one  observation  whose  deletion  from 


2 

that  data  base  will  make  y  >  F.  ^  .  _/F.  ,  „  , 

l-atjljUj-p  l-ct;l,n-p' 


thus 


causing  W  to  decrease.  In  practice,  this  seems  frequently 
to  be  the  case.  Therefore,  a  decrease  in  W  should  not  be 
the  objective  in  determining  observations  to  be  set  aside. 
These  results  suggest  that  a  maximal  ru  should  be  chosen  in 
advance,  according  to  the  analyst's  a  priori  belief  about 
the  maximum  possible  (or  likely)  number  of  outliers,  and 
such  that  n2  <<  (n-p) .  Then,  subsets  of  n2  or  fewer  obser¬ 
vations  whose  deletion  greatly  reduces  W  should  be  examined 
in  view  of  the  discussion  in  the  beginning  of  this  chapter. 
It  should  be  reemphasized  that  the  object  of  such  analysis 

is  not  the  rejection  or  observations,  but  rather  the 
•  * 

gaining  of  insight  about  the  data  under  investigation. 

Points  whose  presence  in  the  data  base  has  a  significant 
effect  on  the  quantities  of  interest  should  be  scrutinized. 
If  the  validity  of  such  observations  is  beyond  question, 
the  reasons  for  such  behavior  should  be  investigated.  This 
should  be  done  in  an  effort  to  gain  a  more  penetrating  in¬ 
sight  into  the  data  under  investigation  which  insight  might 
suggest  otherwise  overlooked  remedial  action  such  as  the 
need  for  collection  cf  more  data  in  certain  regions  of  the 
explanatory  variables'  space,  when  that  is  possible. 


Computation 

For  computational  purposes  it  will  be  preferable  to 
2 

express  y  in  terms  of  the  full  model.  For  this  purpose, 


6 


X'X  X’Y  X'  x* 

Y'Y  Y'  0 

A  ■ 

I  0 

-1 

can  be  set  up. 

After  sweeping  on  the  diagonal  elements  of  X'X, 

(g'g)-1  (X’xr^'Y  (X'X)"1*'  (X'X)  _1x 

Y' Y-Y'X (X'g)"1X' Y  e'  -y 

B  =  . 

I-X(g’X)  -g(K'X)  x 

-l-x(X'X) -1x 


where  e'  denotes  the  l*n  vector  of  residuals.  If  n2  *  1,  the 
quantities  needed  for  Y2  are  available  in  B.  If  n2  >  1,  the 
matrix 


C  = 


I-X2(X’X)_1^  -X2  (X'X)  "1x'  e2 

-l-x(S'£)  _1x'  0 

0 


must  be  formed,  and  the  SWEEP  operator  must  be  applied  to 
the  diagonal  elements  of  I-X2  (£' X)~1g2. 

It  may  be  of  interest  to  note  that  the  change  in  the 
regression  coefficients  when  rows  (£2,Y2)  are  deleted  is 
given  by 

Ab  «  b-b(1)  . 


(5.12) 
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after  some  algebra  using  partitioned  matrices.  The  quanti¬ 
ties  needed  for  Ab  are  produced  in  the  matrices  above. 


Row  Deletion  and  Variable  Augmentation  - 
An  Equivalence 

Suppose  that,  instead  of  deleting  the  last  n2  rows  from 

an  expanded  X  matrix  is  formed  which  will  be  denoted  by 
*  * 

g.  X  is  formed  by  appending  to  X  n2  columns  with  zeros  in 
rows  1,2,..., n^,  and  an  n 2xn2  identity  matrix  for  the  last 
n2  rows .  i . e . , 


* 

X 


»1 

=2 


0 

I 


* 

Let  also  x  *  [x,0] »  where  0  is  a  lxn2  vector  of  zeros. 
Now, 


A  fundamental  identity  on  the  form  of  the  inverse  of  a  par¬ 
titioned  matrix  (see  for  example  (16)  ,  Theorem  8.2.5)  yields 


*  * 

(X  '  g  ) 


(x^r1 


-<5i5i>-1S2 

I+52<SSiSi>-1S2 


Using  this  form  of  (X*'g*)“^, 


*  * 

<3  ) 
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obtains.  So,  obviously, 

1+x*  • 

2  *2  *  * 

Consider  the  relationship  between  s^  and  s  *  e  'e  /  (n^-p)  , 
i.e.,  between  the  mean  square  errors  for  the  model  with  the 
last  n.2  rows  deleted  and  the  model  with  the  new  columns 
appended  to  X  respectively.  Clearly, 

*  *  *  *  -l  * 

e  *  Y-X  (X  'X  )  AX  1 Y 


Si<3SiSi'"lsiXi 


So, 


(e{L) • ,0’) (ej1) 'O’) 


Therefore, 

*2  *  *  (l)  (i)  2 

s  -  e  'e  /(nL-p)  -  ' '  e|  '  /  (n^p)  -  8y 

The  above  relations  suggest  an  algorithm  for  detecting  out¬ 
liers.  If  this  approach  is  used,  any  determination  about 


which  rows  are  outlying  must  be  based  on  the  estimated  co¬ 
efficients  corresponding  to  the  dummy  variables.  A  signi¬ 
ficant  coefficient  indicates  that  the  corresponding  obser¬ 
vation  is  not  adequately  explained  by  the  rest  of  the  data 
and  it  needs  its  own  parameter.  This  becomes  clear  if  one 
observes  that 


*  *  *  _1  * 

b(lf 

b  =  (X  ’2  )  AX  *Y  - 

—  _ 

S 

e(1) 

-2 

Now  the  last  n2  observations  are  fully  explained  by  means 

of  the  coefficients  e^^  .  (Compare  also  the  expression  for 
* 

e  found  earlier) .  The  significance  of  these  coefficients 

is,  therefore,  identical  to  the  significance  of  the  resi- 
(1)  2 

duals  e^  .  The  test  statistic  t2/n2  is  simply  the  usual 
partial  F  for  testing  whether  a  set  of  regression  coeffi¬ 
cients  is  zero  in  the  presence  of  other  explanatory  vari¬ 
ables.  The  F  distribution  can  be  used  only  if  the  set 
(£2,Y2)  has  been  specified  in  advance,  and  not  after  the 
data  have  been  inspected  and,  say,  the  maximum  has  been 
chosen  to  be  tested. 


CHAPTER  VI 
APPLICATION 

In  this  chapter,  an  application  of  this  technique  in 


the  field  of  management  science  is  discussed.  The  perfor¬ 
mance  of  the  W  criterion  is  compared  with  that  of  other  com¬ 
monly  used  selection  criteria  as  well  as  with  that  of  models 
proposed  in  independent  studies  by  other  investigators.  The 
field  of  application,  parametric  cost  estimation,  is  re¬ 
ceiving  considerable  attention  (31),  (33),  (34).  Parametric 
cost  estimation  is  a  widely  used  method  of  obtaining  single 
valued  predictions  of  the  cost  of  a  new  item,  such  as  a 
weapon  system.  It  deals  with  predicting  the  cost  (response* 
variable)  of  a  system  by  means  of  explanatory  variables 
(predictors)  such  as  system  characteristics  or  performance 
requirements.  This  procedure  is  based  on  the  premise  that 
the  cost  of  a  system  is  related  in  a  quantifiable  way  to 
the  system's  physical  and  performance  characteristics.  The 
expression  of  this  quantifiable  relationship  is  in  the  form 
of  an  estimating  equation  derived  through  statistical 
regression  analysis  of  historical  cost  data  on  systems 
which  are,  more  or  less,  analogous  to  the  proposed  system. 
Recent  experience  in  weapon  system  acquisition  programs 
has  underscored  the  differences  between  cost  estimates  and 
realized  costs.  This  has  given  impetus  to  the  search  for 
better  cost  estimating  techniques. 


Consider  the  case  of  predicting  the  cost  of  a  new 
aircraft  based  on  its  planned  physical  and  performance 
characteristics,  and  the  costs  and  characteristics  of  air¬ 
craft  built  in  the  past.  The  role  of  analogy  is  obvious 
in  this  situation.  Which  historical  aircraft  and  which 
variables  should  be  used?  The  Mahalanobis  distance  seems 
well  suited  to  answer  these  questions.  Any  variable  se¬ 
lection  technique  which  ignores  the  issue  of  analogy,  and 
which  fails  to  give  some  consideration  to  dimensions  (vari¬ 
ables)  along  which  there  is  a  marked  dissimilarity  between 
the  proposed  system  and  the  historical  data,  may  lead  to 
gross  errors  due  to  extrapolation.  The  W  criterion  has  the 
potential  of  bringing  this  issue  to  the  attention  of  the 
analyst,  and  should  be  used  as  part  of  a  thorough  investi¬ 
gation.  In  what  follows,  optimal  models  under  different 
criteria  are  "automatically"  computed  and  used  in  an  effort 
to  compare  on  a  fair  (or  equally  unfair)  basis  the  relative 
performance  of  the  W  criterion.  It  should  be  underscored 
that  this  is  done  partly  because  of  considerations  of 
mathematical  convenience  and  it  may  not,  in  all  instances, 
agree  with  good  practice. 

The  data  base  is  given  in  Table  I.  It  consists  of  23 
observations  on  12  physical  and  performance  characteristics 
of  different  single  engine  jet  fighter  aircraft  built  over 
an  interval  spanning  the  years  from  1947  to  1969.  The 
response  variable,  Y,  is  the  flyaway  unit  cost  (in  1972 
$100,000)  of  the  hundredth  aircraft  built  for  each  type. 
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The  variables,  denoted  by  ,X2 , . . . ,X^2 ,  are  the  values  of 
the  following  characteristics: 

X^  «  Wing  Loading  Ratio 

X2  *  Aspect  Ratio 

X3  *  Full  to  Empty  Weight  Ratio 

X4  =  Thickness- to-Cord  Ratio 

Xj  *  Lift  to  Drag  Ratio 

Xg  »  Total  Avionics  Input  Power  in  kva 

X7  *  Maximum  Speed  in  knots  (Clean,  Combat  Weight) 

Xg  »  Weight  Empty  in  lbs 

X-  ■  Rate  of  Climb  in  ft/min,  (sea  level,  combat  weight 
and  power) 

X1Q  »  Combat  Ceiling  in  feet 

X11  *  Ferry  Raja9®  nautical  miles 

X^2  ■  Sea  Level  Static  Thrust  (max)  in  lbs. 

(For  detailed  methodologies  of  data  determination  and  term 
definitions  see  (31)). 

Each  observation  was  set  aside,  the  models  found  opti¬ 
mal  under  six  criteria  were  computed  based  on  the  remaining 
22  observations  and  the  deleted  row  was  predicted.  Various 
statistics  were  also  output  such  as  M  and  MSE  for  each 
model.  The  criteria  compared  were: 

1 .  Minimum  MSE 

2 .  Minimum  C^ 

3 .  Maximum  F 

2 

4.  R  employing  a  subjective  "elbow  rule" 

5 .  MSEP 

6.  Minimum  W  (nominal  95%  prediction  interval). 


For  the  R  criterion/  the  optimal  model  for  each  subset  size 

was  output.  The  rule  employed  for  the  final  selection  was: 

2 

Use  the  model  with  largest  R  whose  size  is  such  that  the 

2 

next  largest  size  does  not  increase  R  by  more  than  one  per¬ 
cent.  Of  the  six  criteria#  the  maximum  F  was  the  most 
parsimonious.  It  selected  variable  Xg  in  all  cases  and  it 
consistently  outperformed  the  others  in  terms  of  the  size 
of  the  absolute  error  of  prediction.  The  minimum  w  cri¬ 
terion  did#  on  the  average,  worse  than  the  others.  The 
case  of  the  F- 111A  aircraft  is  worth  mentioning#  as  it 
clearly  represents  a  situation  which  warrants  special 
attention.  Its  non-analogousness  to  the  rest  of  the  data 
shows  up  clearly  in  every  sort  of  residual  analysis.  His¬ 
torically,  each  military  flight  component  needed  a  new  air¬ 
craft.  The  Air  Force  needed  a  new  interceptor#  the  Navy 
wanted  a  carrier  launched  attack  aircraft  and  the  Marines 
required  an  aircraft  capable  of  ground  support  missions. 

The  then  Secretary  of  Defense#  Robert  McNamara,  decided  to 
have  one  aircraft  built  that  would  meet  all  requirements 
thereby  achieving  tremendous  savings.  The  tri-service 
design  resulted  in  the  F-111A  which#  at  the  time#  was  the 
most  sophisticated#  fastest#  heaviest  and  costly  single 
engine  jet  ever  built.  Its  design  included  a  radical  wing 
which  could  swing  forward  or  backward  depending  on  desired 
flight  characteristics.  (Incidentally,  the  F-111A  experi¬ 
enced  all  kinds  of  technical  problems#  was  not  well  received 
by  the  three  services  and  is  often  referred  to  as 


"McNamara's  Second  Edsel" .)  Its  cost  was  underestimated 
dramatically  by  all  models.  Due  to  the  fact  that  its  weight 
does  not  conform  to  its  other  characteristics  (as  conformity 
is  defined  by  the  other  aircraft) ,  the  Mahalanobis  distance 
associated  with  any  model  which  included  weight  as  a  pre¬ 
dictor  was  excessively  large.  For  this  reason,  weight  was 
not  included  in  the  minimum  W  model,  in  spite  of  its  general 
importance  (it  showed  up  in  every  model  that  was  encountered) . 
The  error  of  prediction  associated  with  this  model  was  far 
above  that  of  every  other  model.  However,  when  weight  was 
forced  into  the  regression,  the  new  minimum  W  model  clearly 
outperformed  all  others.  There  was  only  a  3%  increase  in  W, 
which  was  still  much  smaller  than  the  W's  of  mo'dels  selected 
by  other  criteria.  This  point  will  be  discussed  again  in 
what  follows. 

All  criteria  performed  poorly  on  the  untransformed  data 
in  comparison  to  models  suggested  by  others  (Columbia 
Research  Corporation  (31) ,  Clemson  University  graduate 
student  projects  in  Math  805  [unpublished] )  after  a  com¬ 
plete  analysis.  This  suggested  the  need  for  further  inves¬ 
tigation  of  some  of  the  models  for  signs  of  misspecification. 
The  residuals  were  analyzed  for  the  models  which  were  en¬ 
countered  most  frequently.  In  all  cases,  there  were  clear 
indications  of  gross  violations  of  the  assumptions  on  the 
errors.  The  residuals  exhibited  clear  patterns  when 
plotted  against  the  X*  s  and  against  the  sorted  fitted 
values.  Plots  of  Y  versus  individual  X's  showed  lack  of 
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linearity  which,  although  not  necessary  for  the  linearity 
of  the  multi-variable  model,  tended  to  confirm  information 
in  other  plots.  The  scatter  of  Y  versus  Xg  though  was  more 
or  less  linear.  These  clear  indications  of  model  misspeci- 
fication  tend  to  explain  the  better  performance  of  the  most 
parsimonious  criterion  and  the  poor  showing  of  the  minimum 
W  criterion.  As  mentioned  earlier,  the  maximum  F  criterion 
always  selected  variable  Xg  which  is  both  a  significant 
variable  and  linearly  related  to  Y.  The  poor  performance 
of  the  minimum  W  criterion  is  explained  as  follows: 

This  criterion  is  based  on  a  statistic  (prediction 
interval  width)  the  proper  interpretation  of  which  is  based 
on  the  usual  assumptions  on  the  errors.  When  those  assump¬ 
tions  are  grossly  violated,  the  Mahalanobis  distance  may  no 
longer  be  a  reliable  measure  of  analog.  This  is  the  second 
point  to  be  considered  when  this  criterion  is  employed. 
Appropriate  transformations  on  the  variables  must  be  per¬ 
formed  before  the  selection  is  made.  Also,  after  the 
selection,  an  analysis  of  the  resulting  residuals  is  neces¬ 
sary  in  order  to  validate  the  assumptions  for  the  selected 
models.  Only  models  which  seem  to  satisfy  those  assumptions 
should  be  compared  by  means  of  this  criterion. 

The  problem  of  appropriate  transformations  is  not  a 
simple  one.  Theory  and  common  sense  may  suggest  answers  to 
this  question  but  in  unstructured  situations  a  thorough 
investigation  is  usually  called  for.  There  are  many  pos¬ 
sibilities  and  a  thorough  investigation  is  needed.  As  a 
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step  in  making  the  problem  feasible,  all  variables  were 
transformed  to  their  natural  logarithms.  This  transfor¬ 
mation  is  often  considered  appropriate  for  cost  data  (33) 
and,  in  fact,  was  employed  in  all  the  previously  cited 
competing  models  (31) .  The  same  preliminary  plots  and 
residual  analyses  aluded  to  above  indicated  that  models 
selected  after  this  transformation  did  not  exhibit  signs 
of  gross  misspecification.  The  same  selection  procedure 
was  used.  The  minimum  W  criterion  was  based  on  the  loga¬ 
rithmic  units.  Point  predictions  and  prediction  errors  in 
the  original  units  were  also  calculated  and  compared.  This 
was  done  by  applying  the  exponential  transformation  to  the 
point  predictions.  The  facf  that  this  approach  is  known  to 
produce  biased  estimates  of  the  conditional  mean  (14)  should 
not  affect  the  comparative  value  of  the  estimates.  Trans¬ 
formation  of  prediction  intervals  into  the  original  units 
is  not  easily  defined  so  as  to  make  comparisons  meaningful. 
This  was  not  needed  and  was  not  attempted.  Table  II  shows, 
for  each  aircraft  and  each  model,  the  errors  of  prediction 
in  both  units.  For  each  aircraft  and  for  each  model  dif¬ 
ferent  from  the  minimum  W  model,  the  "gain"  was  also  calcu¬ 
lated,  as  defined  by  the  difference  of  the  absolute  error 
of  that  model  from  the  absolute  error  of  the  minimum  W 
model.  Gains  are  shown  in  Table  III.  (The  entries  of 
Tables  II  and  III  have  been  rounded  to  two  decimal  places 
because  of  space  considerations.)  A  positive  gain  indicates 
a  better  prediction  by  the  minimum  W  model.  A  single  zero 


TABLE  II.  Absolute  Errors  in  Logarithmic  and  Original  Units 
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TABLE  III.  Gain  in  Logarithmic  and  Original  units 
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indicates  that  the  model  selected  coincided  with  the  mini¬ 
mum  W  model.  The  gains  are  shown  for  both  original  and 
logarithmic  units.  An  asterisk  in  Tables  II  and  III  in¬ 
dicates  that  the  aircraft  under  prediction  was  at  a  large 
Mahalanobis  distance  along  the  variables  selected  by  the 
corresponding  criterion.  For  each  criterion.  Table  IV 
shows  the  average  absolute  error  and  the  average  percent 
absolute  error  in  logarithmic  units.  The  average  percent 
absolute  error  is  the  average  absolute  error  as  a  percentage 
of  the  observed  response  value.  The  average  percent  gain 
is  defined  similarly  and  it  is  also  shown,  together  with 
the  sum  of  gains,  in  Table  IV  for  the  five  other  criteria. 
Table  V  shows  the  corresponding  statistics  in  the  original 
units.  From  these  two  tables,  it  can  be  seen  that  the  mini¬ 
mum  W  criterion  outperformed  all  others  on  all  counts  on  the 
average  over  the  23  observations.  The  minimum  criterion 
did  comparably  well  and,  as  can  be  observed  from  Table  III, 
it  selected  the  same  model  as  the  minimum  W  criterion  more 
often  than  any  other.  In  view  of  its  relation  to  the  problem 
of  prediction  discussed  in  Chapter  II ,  this  should  not  be 
surprising.  Notice  also  that  this  criterion  did  better 
than  the  minimum  W  criterion  more  often  than  not,  although 
the  difference  in  some  cases  was  very  small.  The  F104-A 

aircraft  was  predicted  better  by  the  minimum  W  model  in  the 

2 

original  units.  The  performance  of  the  R  criterion  was 


also  comparable  to  that  of  the  minimum  W  criterion,  however, 
a  definite  statement  cannot  be  made  due  to  the  fact  that 
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TABLE  IV.  Performance  Statistics  in  Logarithmic  Units 


Min  MSR 

Min  Ck 

Max  F 

R2 

MSEP 

Min  W 

Average 

lerrl 

0.385 

0.262 

0.456 

0.266 

0.356 

0.253 

Avrg. 

% 

err! 

17.33 

14.43 

22.70 

14.50 

16.78 

12.15 

Total 

Gain 

3.03 

0.21 

4.68 

0.31 

2.37 

Avrg. 

% 

Gain 

5.18 

2.28 

10.55 

2.34 

4.62 

TABLE  V.  Performance  Statistics  in  Original  Units 


Min  MSR 

Min  Ck 

Max  F 

R2 

MSEP 

Min  W 

Average 

errl 

8.636 

4.308 

7.838 

4.459 

7.366 

4.195 

Avrg.  %  | 

err  | 

42.13 

33.55 

54.33 

33.84 

39.66 

25.15 

Total  GaJ 

.n 

102.15 

2.59 

83.78 

6.06 

72.92' 

Avrg.  %  Gain 

17.15 

8.40 

29.18 

8.69 

14.50 

TABLE  VI.  Average  Nominal  95%  Prediction  Interval  Widths 

and  Frequency  of  Coverage  in  Logarithmic  Units 


Avrg.  Width 
Coverage 


Min  MSR  Min  Ck 

1.385  1.250 

86.96  91.30 


Max  F  R2 

1.778  1.287 

91.30  91.30 


MSEP  Min  W 

2.288  1.223 

95.65  95.65 
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the  R  model  is  not  objectively  defined.  In  fact,  if  a  more 
parsimonious  rule  had  been  employed,  the  performance  of  this 
criterion  would  have  deteriorated. 

As  was  mentioned  earlier,  the  minimum  W  statistic  will 
tend  to  underestimate  the  true  95%  prediction  interval. 

The  average  interval  was  calculated  for  each  criterion,  as 
well  as  the  percentage  of  observations  which  were  actually 
covered  by  the  corresponding  prediction  intervals.  These 
are  shown  in  Table  VI.  It  is  a  pleasant  surprise  that,  in 
spite  of  the  observation  above,  the  prediction  intervals 
associated  with  the  minimum  W  models  covered  the  observed 
responses  more  often  than  the  others.  The  occurrence  of 
this  phenomenon  in  the  problem  at  hand  may  not  provide  firm 
ground  on  which  a  claim  that  it  will  happen  in  general  earn 
be  based.  Nevertheless,  it  seems  that  the  prediction  inter¬ 
vals  associated  with  the  minimum  W  models  are  well  centered 
about  the  expected  value  of  y.  This  is  a  very  desirable 
property . 

Referring  to  Table  III,  the  observations  which  the 
minimum  W  model  failed  to  predict  well  were  examined  further. 
This  was  done  in  an  effort  to  identify  common  features 
which  might  serve  as  a  warning  in  a  careful  analysis.  The 
F9F-8  is  the  only  case  in  which  the  W  criterion  was  defi¬ 
nitely  outperformed  by  four  of  the  other  criteria.  The 
model  selected  consisted  of  variables  X2,Xg,Xg  and  Xg.  This 
model  was  never  selected  for  any  other  observation  by  any 
criterion.  More  importantly,  variable  X^2  was  conspicuously 
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absent.  This  variable  was  present  in  all  the  models  with 
the  smaller  prediction  errors  for  each  observation,  and  it 
would  seem  desirable  to  force  it  into  the  final  predictive 
equation.  When  this  was  done,  i.e.,  when  the  model  with 
smallest  W  among  those  containing  this  variable  was 
selected,  the  resulting  errors  of  prediction  in  logarithmic 
and  in  original  units  were  0.056  and  0.259  respectively,  a 
definite  improvement..  The  new  model,  although  it  repre¬ 
sented  a  30%  increase  in  W,  still  had  the  smallest  W  of  all 
models  selected  by  the  other  criteria. 

In  the  case  of  the  F-104A  the  situation  is  not  as  clear. 
The  minimum  W  criterion  again  failed  to  select  variable  X^2. 

However,  its  error  in  logarithmic  units  was  not  much  greater 

2 

than  the  errors  of  the  minimum  C^,  maximum  F  and  R  cri¬ 
teria  which  did  better  and,  in  fact,  in  original  units  the 

2 

error  was  smaller  than  that  of  the  minimum  and  R  models. 
A  closer  investigation  revealed  that  the  Mahalanobis  dis¬ 
tance  of  this  observation  from  the  variables  in  the  minimum 

2 

MSE,  minimum  and  R  models  was  very  large.  In  contrast, 
in  the  case  of  the  F9F-8,  the  reduction  in  the  Mahalanobis 
distance  attained  by  the  omission  of  variable  X^2  w&s  not 
as  dramatic.  When  this  variable  was  forced  into  the  mini¬ 
mum  W  model  for  the  F-104A,  the  logarithmic  error  was  the 
smallest  observed,  (-0.35)  although  the  error  in  the  origi¬ 
nal  units  became  slightly  larger  (-6.68). 

The  above  observations  suggest  the  need  for  a  careful 
examination  of  such  cases.  It  is  as  important  that 


extrapolation  be  avoided,  as  it  is  that  important  variables 
be  retained.  The  way  in  which  these  two  considerations 
should  be  balanced  against  each  other  is  a  matter  of  judg¬ 
ment  on  the  part  of  the  analyst.  The  minimum  W  criterion 
has  the  potential  for  calling  attention  to  such  issues. 

The  case  of  the  F-80  aircraft  is  the  one  clearly 
favoring  this  criterion.  Considerable  reduction  in  the 
Mahalanobis  distance  was  attained,  without  worsening  the 
fit,  by  the  simple  switching  of  certain  variables  while 
retaining  the  important  ones ,  namely  Xg  and  X12*  (As  in 
the  case  of  the  un transformed  data,  variable  Xg  was  con¬ 
tained  in  all  of  the  better  models.  It  was  always  se¬ 
lected  by  all  criteria  except  MSEP. 

The  models  selected  by  each  criterion  and  the  corre¬ 
sponding  observations  are  given  in  Tables  VII  through  XII. 
For  each  criterion,  the  models  selected  were  ranked 
according  to  their  frequency  of  occurrence  and  their  ranks 
were  used  for  labeling  purposes.  The  models  selected  by 
the  MSEP  criterion  are  all  marked  by  Observe  that 

twenty  different  models  were  selected  by  this  criterion, 

none  of  which  was  ever  selected  by  any  other  criterion,  due 

2 

to  the  fact  that  they  were  all  associated  with  small  R 
values.  In  view  of  the  fact  (mentioned  in  Chapter  II)  that 
this  criterion  ignores  the  MSE  (and  every  other  measure  of 
fit)  of  the  postulated  submodels,  this  should  not  be  sur¬ 
prising.  Observe  also  the  large  average  prediction 
interval . 
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TABLE  VII.  Minimum  W  Models  and  Aircraft  Predicted 


A/C 


P-80 

FH-1 

F2H-1 

F7U-1 

F-84E 

F3D-1 

F-86H 

F9F-8 

F4D-1 

F3H-1N 

F-102A 

F-100D 

FJ-4 

F-104A 

F11F-1 

F-105B 

F-101C 

F-106B 

F-4B 

F-5A 

F-4J 

F-111A 

F-8E 
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TABLE  VIII.  Minimum  MSE  Models  and  Aircraft  Predicted 


A/C 


F-80 

FH-1 

F2H-1 

F7U-1 

F-84E 

F3D-1 

F-86H 

F9F-8 

F4D-1 

F3H-1N 

F-102A 

F-100D 

FJ-4 

F-104A 

F11F-1 

F-105B 

F-101C 

F-106B 

F-4B 

F-5A 

F-4J 

F-111A 

F-8E 
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TABLE  IX. 


Minimum  Models  and  Aircraft  Predicted 


A/C 


F-80 

FH-1 

F2H-1 

F7U-1 

F-84E 

F3D-1 

F-86H 

F9F-8 

F4D-1 

F3H-1N 

F-102A 

F-100D 

FJ-4 

F-104A 

FllF-1 

F-105B 

F-101C 

F-106B 

F-4B 

F-SA 

F-4J 

F-111A 

F-8E 


Variables 


1 

4 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 


‘4 

3 


6 

3 


1 

4 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 


8 

3 
1 

4 
1 
1 
1 
1 
1 
1 
1 
2 
1 
1 
1 
1 
1 
1 
1 
1 
2 
1 
1 
1 


k10 


X 


11 


'12 

3 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 
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TABLE  X.  Maximum  F  Models  and  Aircraft  Predicted 


A/C  Variables 


F-80 

FH-1 

F2H-1 

F7U-1 

F-84E 

F3D-1 

F-86H 

F9F-8 

F4D-1 

F3H-1N 

F-102A 

F-100D 

FJ-4 

F-104A 

F11F-1 

F-105B 

F-101C 

F-106B 

F-4B 

F-5A 

F-4J 

F-111A 

F-8E 


x?  xg  xg  x1Q  xu  x12 


2  2 


3  3  3 


2 

1 

3 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 


2 

3 
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TABLE  XX. 


Models  and  Aircraft 


Predicted 


A/C 


P-80 

PH-1 

F2H-1 

F7U-1 

F-84E 

P3D-1 

P-86H 

P9F-8 

P4D-1 

P3H-1N 

F-102A 

P-100D 

FJ-4 

P-104A 

FllF-1 

P-105B 

P-10 1C 

P-106B 

P-4B 

F-5A 

F-4J 

F-111A 

P-8E 


Variables 


X1  X2  X3  X4  X5  X6 


3  3 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

1  1 

2  2  2 

1  1 

2  2  2 

1  1 


3 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

2 

1 
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TABLE  XII.  MSEP  Models  and  Aircraft  Predicted 


A/C  Variables 


F-80 

FH-1 

F2H-1 

F7U-1 

F-84E 

F3D-1 

F-86H 

F9F-8 

F4D-1 

F3H-1N 

F-102A 

F-100D 

FJ-4 

F-104A 

F11F-1 

F-105B 

F-101C 

F-106B 

F-4B 

F-5A 

F-4J 

F-111A 

F-8E 


X 


10 


With  respect  to  the  minimum  W  models,  it  seems  that 
most  early  and  later  aircraft  were  predicted  by  one  model, 
while  most  of  the  ones  in  the  middle  of  the  time  scale 
were  predicted  by  another.  Three  aircraft  required  their 
own  models.  The  F2H-1  was  not  predicted  well  by  any  cri¬ 
terion  and  it  was  the  one  observation  whose  deletion  from 
the  data  base  reduced  dramatically  the  widths  of  the  pre¬ 
diction  intervals  associated  with  the  minimum  W  model  for 
all  but  one  of  the  other  observations.  The  studentized 
residual  (discussed  in  Chapter  V)  which  was  associated  with 
this  aircraft  was  very  large  (4.2  on  the  average)  for  each 
and  every  aircraft  under  prediction.  The  F9F-8  was  selected 
for  deletion  in  the  one  remaining  case.  (In  general,  the 
performance  of  the  minimum  W  criterion  improved  when  one 
observation  was  deleted  for  each  prediction.  A  detailed 
exposition  is  not  given  since  the  deletion  of  observations 
is  not  advocated  in  this  dissertation.)  The  one  observation 
whose  deletion  reduced  W  the  most  in  each  case,  and  the 
percent  reduction  in  the  width  of  the  "prediction  interval" 
are  shown  in  Table  XIII.  The  new  "prediction  intervals" 
contained  the  observed  y's  only  86.96%  of  the  time. 

Finally,  it  should  be  emphasized  that  the  purpose  of 
this  analysis  being  the  gaining  of  insight  into  the  rela¬ 
tive  performance  of  the  minimum  W  criterion,  certain  aspects 
of  the  problem  (which  a  complete  analysis  should  not  fail 
to  consider)  where  not  stressed.  Data  determination  and 
model  form  specification,  for  instance,  received  only 
secondary  attention. 
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TABLE  XIII.  Observation  Deleted  and  Maximum  Reduction  in 
Width  of  the  Prediction  Interval  for  Each  Aircraft 


A/C 

A/C 

Deleted 

Reduction 

(%) 

P-80 

F2H-1 

33.23 

FH-1 

F2H-1 

31.06 

F2H-1 

F-102A 

26.01 

F7D-1 

F2H-1 

25.20 

F-84E 

F2H-1 

25.17 

F3D-1 

F2H-1 

33.22 

F-86H 

F2H-1 

16.39 

F4D-1 

F2H-1 

25.48 

F3H-1N 

F2H-1 

25.59 

F-102A 

F2H-1 

25.51 

F-100D 

F2H-1 

31.31 

FJ-4 

F2H-1 

29.53 

F-104A 

F9F-8 

7.37 

F11F-1 

F2H-1 

26.08 

F-105B 

F2H-1 

28.79 

F-101C 

F2H-1 

25.20 

F-106B 

F2H-1 

25.49 

F-5A 

F2H-1 

39.37 

F-4J 

F2H-1 

32.78 

F-111A 

F2H-1 

29.64 

F-8E 

F2H-1 

31.78 
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Other  data  sets  were  also  analyzed  in  less  detail.  A 
limited  simulation  study  was  conducted,  in  which  the  cor¬ 
relation  matrix,  the  number  of  variables  and  the  number  of 
observations  were  allowed  to  vary.  Although  a  complete 
investigation  would  be  a  large  project  and  was  not  attempted, 
the  observations  made  during  these  studies  seem  to  support 
the  ones  made  on  the  aircraft  data.  In  the  absence  of 
variables  which  appeared  fundamental  to  the  predictive 
equation,  the  minimum  W  criterion  consistently  outperformed 
the  other  criteria  whenever  it  succeeded  in  reducing  a 
large  Mahalanobis  distance.  This  was  more  pronounced  in 
the  cases  where  the  correlation  structure  involved  high 
multicol linearity.  In  these  last  cases,  large  Mahalanobis 
distances  were  frequently  reduced  significantly  by  the 
exclusion  of  variables  which  caused  the  multicol linearity. 

Although  in  no  way  conclusive,  it  may  be  of  interest 
to  note  that  the  minimum  W  models  performed  better  (along 
the  same  lines  discussed  above)  than  models  suggested  by 
other  investigators  after  careful  analyses  on  the  aircraft 
data.  This  in  no  way  means  that  mechanical  selection  of 
variables  is  preferable  to  a  careful  investigation.  The 
minimum  W  criterion  should  be  used  as  part  of  a  complete 
analysis.  As  is  the  case  with  virtually  every  data  ana¬ 
lytic  technique,  pedestrian  application  can  result  in 
curious  and  misleading  conclusions.  There  is  no  substitute 
for  a  careful,  reasoned  analysis. 
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CHAPTER  VII 


DISCUSSION  AND  CONCLUSIONS 

This  investigation  has  been  concerned  with  the  problem 
of  predicting  the  response  at  a  known  point  in  the  space  of 
the  explanatory  variables  in  the  context  of  multiple  linear 
regression.  This  is  the  problem  with  which  parametric  cost 
estimation  is  concerned  and  it  is  encountered  frequently  in 
other  applications.  The  view  taken  in  this  dissertation  is 
that  the  issue  of  analogy  of  the  point  under  prediction  to 
the  historical  data  should  not,  in  such  cases,  be  ignored. 

The  Mahalanobis  distance  has  been  studied  as  an  appro¬ 
priate  measure  of  analog  in  higher  dimensions.  The  width  of 
the  prediction  interval  is  a  numeraire  which  combines  this 
measure  of  analog  with  the  mean  square  error,  which  is  a 
standard  measure  of  the  fit  provided  by  a  given  model. 

Thus,  the  prediction  interval  offers  itself  as  a  tool  with 
which  variables  can  be  screened  and  models  brought  in  the 
foreground  that  are  reasonable  candidates  for  the  purpose  of 
such  analyses.  The  experience  gained  by  applying  this  meth¬ 
odology  to  real  and  simulated  data  sets  suggests  that  the 
careful  analyst  should  benefit  from  its  use  by  gaining 
insight  on  an  aspect  of  the  problem  which  otherwise  would 
not  have  been  brought  into  focus.  As  was  mentioned  earlier, 
the  careful  analyst  would  certainly  become  skeptical  about  a 


model  which,  although  otherwise  reasonable,  produced  a  very 
large  prediction  interval  at  the  point  under  consideration. 
However,  the  W  criterion  has  the  advantage  of  providing  a 
clear  and  unconfused  warning  by  taking  the  issue  of  analogy 
into  consideration  during  the  screening  process.  The  W  cri¬ 
terion,  just  as  any  other,  should  not  be  used  in  a  pedes¬ 
trian  way  as  a  method  for  pointing  to  "the  one  best  model". 
The  notion  of  a  model  which  is  best  for  all  purposes  is  not 
defined  in  unstructured  situations  which  comprise  the  bulk 
of  empirical  model  building.  Even  for  a  specific  applica¬ 
tion,  a  claim  about  the  knowledge  of  such  a  model  can  not  be 
defended  on  (incontestable  grounds.  Therefore,  selection 
criteria  ought  to  be  viewed  as  screening  aids  and  used  as 
such.  This  point,  although  generally  accepted,  is  all  too 
frequently  forgotten  in  practice. 

There  is  a  second  point  on  which  the  W  criterion  may 
prove  to  be  a  valuable  aid.  The  exclusion  of  variables 
which  are  known  to  be  important  from  other  considerations 
ought  to  be  taken  as  a  warning  about  the  peculiarity  of  the 
point  under  prediction.  Selection  criteria  may  occasionally 
point  to  models  which  the  analyst  finds  unacceptable  either 
because  of  the  variables  which  they  contain  (or  exclude)  or 
because  of  the  fact  that  the  underlying  model  assumptions 
seem  to  be  violated.  In  such  cases,  the  usual  statistics 
lose  their  validity  and  risks  attendant  with  the  use  of  a 
suspect  model  are  introduced.  If  no  reasonable  model  can  be 
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found  which  passes  model  aptness  considerations,  the  regres¬ 
sion  approach  to  the  problem  based  on  the  given  body  of  data 
should  be  questioned. 

In  parametric  cost  estimation  where  the  cost  of  an 
object  system  must  be  predicted  based  on  historical  data  on 
"similar"  systems  built  in  the  past,  the  notion  of  analogy 
often  presents  itself  conspicuously.  It  is  conceivable  that 
the  proposed  system  may  reflect,  in  the  values  of  the  expla¬ 
natory  variables  associated  with  it,  technological  develop¬ 
ments  and/or  performance  characteristics  different  from 
those  encountered  in  the  historical  data  along  certain 
dimensions.  A  new  weapon  system,  for  instance,  would  proba¬ 
bly  not  even  be  considered  if  such  were  not  the  case.  A 
model  which  fails  to  explain  the  historical  data  adequately 
would  be  unreasonable  to  use  for  the  prediction  of  the  new 
system.  It  seems  equally  unreasonable,  however,  to  devote 
all  effort  into  fitting  the  historical  data,  disregarding 
the  relation  of  the  proposed  system  to  them.  The  W  crite¬ 
rion  can  (and  should)  be  employed,  together  with  other  con¬ 
siderations,  so  that  both  aspects  of  the  problem  will  be 
given  deserved  attention  if  the  final  model  is  not  to  be 
grossly  myopic. 

Models  which  are  found  good  by  more  than  one  criteria 
are  highly  desirable.  The  W  criterion  can  be  employed  to 
suggest  several  models  which  can  then  be  compared  with  those 
suggested  by  other  criteria.  This  procedure  will  focus  the 


attention  of  the  analyst  on  a  (hopefully)  small  number  of 
models  which  must  be  carefully  scrutinized  before  a  final 
choice  is  made. 

The  distributional  properties  of  W  under  selection  pose 
a  highly  complex  problem  which  has  not  been  investigated  in 
this  dissertation.  Another  problem  which  has  not  been  con¬ 
sidered  is  the  following:  The  statistic  W  is  expressed  in 
the  units  of  the  response  variable  used  in  the  selection 
process.  Often ,  various  transformations  on  the  response 
variable  are  considered  in  the  same  problem.  How  is  one  to 
compare  the  Ws  associated  with  models  based  on  different 
transformations  on  the  response  variable?  This  is  a  ques¬ 
tion  which  can  only  be  answered  on  a  case  by  case  basis.  In 
some  cases  it  may  be  a  simple  mathematical  problem,  while  in 
others  it  may  defy  an  objective  definition. 

Finally,  although  an  extensive  simulation  study  was  not 
conducted,  such  a  study  may  be  a  worthwhile  endeavor  that 
can  provide  useful  insight  into  the  questions  raised  in  this 
investigation . 
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20.  (continued) 

A  review  of  eowonly  used  selection  criteria  la  given,  with  special 
•aphasia  on  those  which  deal  with  the  probl«  of  prediction.  The 
Mahalanobia  distance  Is  one  of  the  quantities  affecting  the  width  of  the 
prediction  Interval,  and  it  is  studied  in  some  detail.  The  effects  of 
adding  a  new  variable  to  a  nodal  are  investigated  and  a  none tonicity 
theorem  Is  derived. 

The  influence  of  an  observation  on  the  width  of  the  prediction  inter¬ 
val,  as  measured  by  the  effected  change  when  that  observation  Is  set  aside. 

Is  also  investigated  and  an  equivalence  between  observation  deletions  and 
variable  augmentation  is  shown. 

The  relationships  found  in  these  investigations  indicate  the  applica¬ 
bility  of  certain  computing  techniques.  Computing  algorithms  are  presented. 

A  management  science  application  os  the  statistical  procedures  developed 
in  this  study  is  explored  in  the  area  of  parametric  cost  estimation. 


